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Foreword 


Timo Honkela 


Do historians need computers for more than writing? What could be the other 
uses? Does the use of computers pose a threat to history by bringing in a reduc- 
tionist approach? Perhaps surprisingly to some, I would say that historians are 
dealing with phenomena that may be characterised as much more complex than 
those studied by scholars in natural sciences. This has meant that historians 
have had to work using educated human interpretation rather than machines 
or algorithms as their methodological basis. In natural sciences, it is common- 
place to look for underlying simple causes even for phenomena that appear to 
be highly complex on the surface. Such a reductionist approach is neither wise 
nor productive in many areas and topics within history, because human behav- 
iour and social organisation includes complexities like phenomena’ instabil- 
ity as they change over time. The reliance on words of written languages and 
human-made images and less on numbers based on measurements, and the 
non-linear dependence of context and complex feedback mechanisms between 
society, individuals and their times further increase the intricacy of a histori- 
cal study. An example of the latter is that a written document on history may 
itself influence the track of future events and thus history can change history. 
These kinds of connections are not known to exist within the natural sciences. 
History is different. 

I have for a long time followed, participated in and had aspirations for con- 
necting history and computer science. In the 1980s, I as a computer scientist 


How to cite this book chapter: 

Honkela, T. (2020). Foreword. In M. Fridlund, M. Oiva, & P. Paju (Eds.), Digital 
histories: Emergent approaches within the new digital history (pp. xvii-xix). Helsinki: 
Helsinki University Press. https://doi.org/10.33134/HUP-5-19 


xviii Digital Histories 


was using rule-based artificial intelligence (AI) methods in particular in the 
area of natural language processing. Thanks to my personal contacts, I had 
often a chance to discuss with historians their work, results and methodologies. 
Soon the idea came up of finding ways to build a bridge between historians 
and AI researchers, but after some consideration this did not seem feasible or 
relevant. With the rule-based AI methods, it was not possible to approach phe- 
nomena that were relevant for historians. Computer science was not sufficiently 
developed for historians. The challenge was both quantitative and qualitative. 
However, as history tells us, and this volume gives many examples of, our tools 
and our times change. Since the 1990s, my experience of using and developing 
neural networks and statistical machine learning methods has been quite dif- 
ferent. My personal experience of applying these methods since 1991 led to the 
conclusion that important opportunities were available, in which the expertise 
of the historians was a central asset. In the 2000s, I entered into a joint research 
project using neural networks on digital history with a historian that succeeded 
in producing results of mutual interest to me, the computer scientist, as well as 
to the historian. Computer science had caught up with history. 

Today, the situation to some degree has been reversed, in that historians have 
started to seek out the advanced methodologies of computer science. Histori- 
ans, like all researchers in the humanities and social sciences, wish to work in 
an analytically and methodologically solid manner. Occasionally, the success of 
natural sciences has led historians to find research questions where the borrow- 
ing of methods of natural sciences would be suitable and sufficient. This might 
be for wishes to achieve a wider generalisability and predictive power may be 
sought. However, in many cases, this leads to reductionism that prevents the 
results to be relevant within the complex world of humans as individuals and 
as social constellations. The challenges of how to account for symbol function, 
human intentionality and the role of artifacts are just some of the many fac- 
tors that still render history a very challenging field for computer science. But 
the development within computational analysis, modelling and visualisation 
methods and tools are changing this situation, which is exemplified by the 
research in this volume. 

The current research illustrates the state-of-the-art nature of this col- 
laboration and when looking even further ahead there are new challenging 
opportunities related to history that stem from possible collaboration among 
history, other disciplines in humanities and computer science. From the point 
of view of cognitive linguistics, the meaning of meaning could be studied more 
carefully than what is usually done within the study of history. The meaning 
of linguistic expressions is dependent on the historical, societal and linguis- 
tic context in which they were written. Qualitative nuances can further be 
obtained by studying how items such as words, names, events, periods of time, 
individual persons or institutions are related to one other through some data 
than can be studied using statistical machine learning methods. These results 
can further be studied using the historian's expert considerations. Although 
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this line of research is already being conducted today, the studies are usually 
limited by focusing on time series or so-called matrix data. 

If one would try to predict what the future might offer for the next generation 
of digital historians, one mathematical concept in particular could be consid- 
ered relevant for historians, that of the tensor. In order to explain what a tensor 
is about, one can consider Excel spreadsheets. As a mathematical structure, a 
matrix is like an Excel spreadsheet. A kind of plate of numbers on a table. A 
tensor is an extension of this structure. In the case of a tensor, there are sev- 
eral layers on top of each other. For instance, in the traditional matrix (‘on the 
table’), the number of people in the different neighbouring areas may be stored 
as numbers. In a tensor, layers of different years may be added so that the num- 
ber of people in the different years can be conveniently stored and analysed. 
Tensors are essential extended data structures that could be highly useful in, 
for example, studying conceptual change in history, varieties of interpretations 
of the same word or item by different people, the development of some phe- 
nomenon over time within a complex context, or filling in gaps in history with 
hypothetical data to be searched for. This potential of the tensor remains for 
future historians and computer scientists to realise. Naturally, it is not the only 
promising direction from the methodological point of view. 

The works in this collection provide a view on how history can be simultane- 
ously studied with analytical rigour and without the need to straightforwardly 
accept the need of reductionism. The developments of studying human behav- 
iour, culture and history with computational modelling, data science and com- 
plexity science and thus to engage with and better understand the new tools of 
our digital times is also a promise of an increasingly better ability and central 
societal role for history to help us understand our digital present and future. 
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Digital and Distant Histories 


Emergent Approaches within the 
New Digital History 


Petri Paju, Mila Oiva and Mats Fridlund 


Half a century ago, historian Emmanuel Le Roy Ladurie, when surveying the 
progress of quantitative history, prophesised that ‘tomorrow’s historian will 
have to be able to programme a computer in order to survive.! Since then, com- 
puters and programming have indeed profoundly changed historians’ practice 
through such digital tools as word processing, the internet, email, PowerPoint, 
Google, JSTOR, Facebook, Twitter and Zoom. They have made all of us histori- 
ans into digital historians in one way or another. As these digital tools used by 
most historians illustrate, there are many ways that the digital has transformed. 
the historian’s craft beyond mere practical and administrative improvements. 
During the new millennium, the computer together with the internet have 
begun to change also the historian’s research tools and methods in new and 
previously unforeseen ways into a novel kind of digital history. It is this new 
emerging digital history, together with some ever-significant approaches of the 
‘old’ quantitative digital history, that is the subject of this book. 

Digital history encompasses diverse historical practices, such as digitisa- 
tion efforts at archives, libraries and museums, computer-assisted research, 
web-based teaching and professional and public dissemination of historical 
knowledge, as well as research on the history of ‘the digital; computers and 
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digital technologies. One comprehensive definition capturing this diversity of 
practices was suggested more than a decade ago in a discussion between digital 
historians in the Journal of American History: 


Digital history is an approach to examining and representing the past 
that works with the new communication technologies of the computer, 
the Internet network, and software systems. On one level, digital history 
is an open arena of scholarly production and communication, encom- 
passing the development of new course materials and scholarly data 
collections. On another, it is a methodological approach framed by the 
hypertextual power of these technologies to make, define, query, and 
annotate associations in the human record of the past. To do digital 
history, then, is to create a framework, an ontology, through the tech- 
nology for people to experience, read, and follow an argument about a 
historical problem.’ 


While the digital embraces the whole spectrum of the historian's craft, this vol- 
ume focuses on digital history as a form of scholarly research that uses digital 
sources and tools to produce new historical knowledge. This form of digital his- 
tory research is part of the larger digital turn in academia, identified as digital 
humanities, culture analytics, computational social sciences and other concepts 
related to utilisation of computer-assisted methods for research.’ By bringing 
together research contributions to the new digital history from historians, 
computer scientists, computational linguists and other scholars producing new 
empirical historical knowledge using digital methodologies, as well as con- 
ceptually focused perspectives on critical issues of the field's past, present and 
future development, this book provides digital histories that we hope will be 
read as laudable exemplars from within the emergent digital history research 
community. The digital histories collected here simultaneously represent vari- 
ous methodological applications of and themes within digital history research 
and thus an attempt to take stock of current research rather than providing a 
pedagogical textbook or programmatic manifesto. The new digital history has 
matured enough for us to instead be able to present historical work currently 
furthering historical research. Thus, the studies in this book take digital history 
beyond discussions of its future potential, proofs of concept and pedagogical 
examples to instead focus on digital history ‘in action, to the making of new 
historical knowledge. 

Through this focus on presenting results from digital history research pro- 
jects, this book breaks new ground within the current wave of digital history. 
Other digital history books published so far have mainly been monographs 
focused on discussing how historians could use digital sources or methods to 
conduct and present research such as the pioneering Digital history: a guide 
to gathering, preserving, and presenting the past on the web (2006) by Daniel 
J. Cohen and Roy Rosenzweig, or anthologies such as History in the digital 
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age (2013) contributing discussions of the problems and possibilities of doing 
the new digital history, rather than research results of historical studies using 
new digital methodologies.* This is as expected, as it is only during the last 
couple of years that we have seen the first research publications using digital 
history research methodologies within mainstream academic historical pub- 
lishing outlets. Matthew Jockers’ Macroanalysis (2013) appears to be the first 
research monograph published by a university press, and Cameron Blevins’ 
‘Space, nation, and the triumph of region’ (2014) is the first peer-reviewed 
research article published in the Journal of American History. In this way, this 
book aspires to pioneer and promote work within the new digital history by 
being a timely research anthology from the current third generation of digital 
historians that,* outside of digital spatial history,’ focuses on contributing new 
historical knowledge from research using digital research methodologies. 


Emergence of the New Digital History 


The roots of exploiting modern data-processing equipment in humanities 
research date back at least to the 1950s, when Josephine Miles started using punch 
cards for literary analysis.* The development was continued in the 1950s with 
Father Roberto Busa utilising IBM mainframe computers and John W. Ellison 
using the UNIVAC I to produce lexical concordances.? Since then, computer- 
assisted history research has produced three ‘generations, roughly following the 
advancement of computers and internet technologies. Simultaneously, there are 
continuities of parallel developments borrowed from, or developed together 
with, sister disciplines, such as text analysis in literary studies, statistical anal- 
yses in economic and social history, Geographic Information System (GIS) 
within geography, and digital image analysis in art history and visual studies. 
Allegedly, ‘the first published work by an historian involving actual computer- 
ized research came in 1963 with a ‘scalogram analysis of voting patterns’ in the 
British Parliament in the 1840s by William Aydelotte at the State University of 
Iowa." A few years later, Viljo Rasila (Paju, this volume) did somewhat similar 
work in Finland. The first larger and more widespread application of computers 
was by the cliometricians of the 1960s, who were recognised as constituting the 
first generation of digital historians. They were followed by a second generation 
centred around the new ‘personal’ computers in the 1990s and were often seen 
as a part of the wider humanistic research field of 'humanities computing: 

The current third generation of digital history can be said to begin to emerge 
in the late 1990s and the early 2000s with the appearance of the first large digit- 
ised full text databases, such as Early English Books Online (EEBO) and Project 
Gutenberg," and with the rebranding and expansion of humanities computing 
to digital humanities in the mid-2000s. Since the early 2000s, contemporary 
historians’ toolkit has been expanded by an increasing volume of digitised 
sources and the swift development of computational analysis methods. This 
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was taking place at the same time as geographical history was going through a 
development from historical GIS (HGIS) to spatial history. 

The snowballing growth in the amount of digital sources and the development 
of new research approaches and concepts has gradually increased the number of 
humanists using computational methods. One of the most frequently used con- 
cepts is distant reading, a perspective pioneered by the literary historian Franco 
Moretti. Distant reading can be understood as a counterpart to close reading 
that has been used extensively in humanities for distilling meanings from texts 
from the 1970s onwards. Distant reading has been used to extract meaning- 
ful patterns from textual sources, particularly when the number of texts are 
so numerous that it is impossible for a human to read them in a consistent 
manner." The examples in this volume show that distant reading can also bea 
useful approach for exploring smaller amounts of text, as it provides another 
kind of approach to the texts in focus. Such machine or algorithmic reading 
provides ‘another pair of scholarly glasses’ and allows examining the sources 
from new perspectives. In the best case, close and distant readings complement 
each other. 

Characteristic for this potentially paradigmatic digital history (Fridlund, this 
volume) is not just the introduction of new conceptualisations, such as ‘distant 
reading, ‘macroanalysis’ or ‘algorithmic reading, or the application of method- 
ological tools such as topic modelling, but also the utilisation of novel practices 
for historians, new digitally augmented ways of working. Digital research 
brings along the collaborations of larger multidisciplinary group projects, the 
use of centralised technical infrastructures and machines. The changes that are 
taking place in history today are in several ways reminiscent of the changes 
that natural science disciplines such as physics and biology went through ear- 
lier with changeover from individuals’ ‘small science’ tabletop experiments to 
interdisciplinary large team ‘big science’ collaborations. 

The origin of this volume lays in an initiative to strengthen digital history 
research proposed by a collective of historians in Finland in 2015. That ambition 
was generously funded by the Kone Foundation through two interconnected 
projects 2016-2018, which brought together the majority of the authors in this 
volume. The first project, Towards a Roadmap for Digital History in Finland, 
aimed at identifying practical, professional and institutional obstacles and 
possibilities for developing digital history research. The second project, From 
Roadmap to Roadshow, built on the first one by bringing together digital his- 
torians to shape the best practices for disseminating knowledge about digital 
methods to historians so that in the end these would facilitate new digital his- 
tory research. This was accomplished through a road tour to six major Finnish 
research universities, where the project organised presentations and workshops 
on emerging research and methodological developments within digital history. 

Originally, the aim of the project was to end after the roadshow and to con- 
clude with the subsequent publication of articles by the three main project 
researchers. However, the enthusiasm among the participants at the various 
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universities promised new digital research results in a not-so-distant future. 
Thus, towards the end of their shared work, the team decided to extend 
the project towards its logical conclusion by organising a workshop during the 
spring of 2018 that invited historians and specialists in digital research meth- 
ods to come together to work collaboratively on formulating and answering a 
number of specific and concrete historical research questions. The historians 
who responded to the invitation brought their source materials and historical 
research questions, while the digital specialists contributed with their meth- 
odological expertise to jointly find answers to the research questions. At the 
workshop, the research teams analysed the sources to come up with prelimi- 
nary solutions and answers and afterwards the teams were encouraged to keep 
working on their projects, and in this book, several of those projects are now 
brought to completion in the form of peer-reviewed research articles. They are 
complemented by articles from other digital historians, presenting results from 
a selection of the other recent research projects. 

The majority of the research presented here is by digital historians active at 
Finnish universities. The rationale behind this is, in addition to the books’ spe- 
cific historical origins as explained above, that the emerging Finnish digital 
history community is both a representative and in many ways exemplary part 
of the larger international development of digital history. It is representative in 
that the used methodological research approaches correspond to the predomi- 
nant directions of current digital history and thus the diversity and breadth 
of the studies presented in this volume, representing digital history research 
in a wide range of topics, from diverse disciplines from political, economic, 
cultural, intellectual and feminist history to history of science and technology 
and periods going from the Early Modern to the recent past. Taken together 
they provide a representative overview of the state-of-the-art of not just 
Finnish digital history research, but also of emerging digital history overall. 
Like most other research communities, the digital history landscape in Finland 
is diverse and dispersed, including bigger research groups, individual research- 
ers and interdisciplinary and collaborative projects with national and foreign 
colleagues in Finland and abroad. This volume is exemplary in that digital his- 
tory in Finland as a community and practice can be said to be more developed 
and institutionalised than in many other countries. In addition to several digi- 
tal historians working at all levels of academic seniority, there are designated 
doctoral positions and professorships, textbooks, a regular digital history con- 
ference series and seminars and a digital history section within the national 
historical society. Compared to most other countries, the stage of digitisation 
of newspapers and archival documents is very advanced, which encourages 
digital history research. The common understanding of digital historians in 
Finland is that the focus of digital history research should be in finding answers 
to the research questions rather than utilisation of digital research tools just for 
their own sake. The contributions to this book, we feel, exemplify that critical 
evaluation of digital sources, metadata and research methods, and the results 
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they provide, are the basic components of good digital history research. Thus, 
the Finnish digital history research, together with the other contributions pre- 
sented in this volume, should be a good representation of some of the most 
widely shared research practices emerging within the new digital history. 


The New Digital and Distant Histories 


This book contributes to advancing the field of history in primarily two ways, 
through new conceptual explorations of the past, present and future of digi- 
tal history research and with new empirical historical knowledge coming out 
of research using digital methods. Through this, we aim to illuminate the 
new digital history’s potential and pitfalls. We have divided the book into 
four parts. Part I “The Beginning’ consists of this introduction. Part II ‘Mak- 
ing Sense of Digital History’ starts with discussing the historical and meth- 
odological roots of digital history and contributes conceptual and contextual 
explorations of the current state of digital histories. Part III ‘Distant Reading, 
Public Discussions and Movements in the Past’ presents empirical case studies 
from various time periods that through the application of digital tools, primar- 
ily various forms of distant reading methodologies, demonstrate the further 
potential for expanding historical knowledge. The final Part IV ‘Conclusions’ 
draws the volume to an end by an exposition of the landscape of digital history 
and its future potential. 

In the foreword, the late computer scientist and pioneering digital humanist 
Timo Honkela, draws on his wide experience of multidisciplinary cooperation 
using computational tools, to offer his thoughts on the digital future of history. 
In Chapter 2, providing a longer historical context for the new digital history, 
Petri Paju examines the history of computer-assisted history research from the 
1960s until the 2010s. By focusing on one particular national development, that 
of historians’ use of computers in Finland, he recognises how, although a par- 
ticular national story, it was part of a larger, international and transnational 
pattern of development within digital history research. 

After the overview of the roots of digital history, the subsequent chapters in 
Part II shed light on the fundamental components of digital history research: 
data, metadata and the mundane, often manual, work enabling the operation 
of our digital tools and resources. In Chapter 3, Jari Eloranta, Pasi Nevalainen 
and Jari Ojala exemplify how economic and business historians in many ways 
have been forerunners of digital history with computerised analysis of numeri- 
cal and event code databases. They also share their experiences of the chal- 
lenges to historical research of digitisation and uses of databases. Chapter 4 
by Mats Fridlund attempts to conceptualise emerging historical practice by 
exploring the present state of digital history research according to two ideal 
types of digital history. Following Thomas Kuhn's theory of scientific revolu- 
tions, he describes them as ‘normal and ‘paradigmatic’ digital history. Further, 
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as a middle way between the two, he proposes one that is beyond the normal 
but still a less revolutionary form of semi-automatic digital history, described 
as digital history 1.5° 

This is followed by Chapter 5, which concerns research infrastructures, where 
Jessica Parland-von Essen calls for better data management and increasing the 
openness of data. She presents the FAIR (Findable, Accessible, Interoperable 
and Re-usable) approach to data, which would not only improve the efficiency, 
but also increase the trustworthiness and quality of historical research. The 
critical theme of the role of metadata in digital history research is taken up 
in Chapter 6 by Kimmo Elo, who points out that when focusing on data, we 
often neglect metadata, although it is a crucial part of the whole. In his chapter, 
he explores ways of improving the quality of the historian’s metadata. Follow- 
ing this is a valuable reminder offered by Johan Jarlbrink in Chapter 7 on the 
importance of manual work to digital machine processing. In his chapter, he 
shows how digital research is far from automated, and that it actually requires 
countless hours of manual work which most of the time stays invisible and thus 
its problems and possibilities are often unnoticed and neglected. 

The subsequent chapters offer a wide array of empirical case studies using 
a selection of digital research methods that exemplify how they can help us 
to reach for new understandings of the past. Beginning this series, in Chapter 8, 
Mirkka Danielsbacka, Lauri Aho, Robert Lynch, Jenni Pettay, Virpi Lummaa 
and John Loehr use statistical quantitative analysis to explore migration 
of Finnish individuals in the 20th century. Using a database that they have 
digitised and complemented with other historical data, they explore socio- 
demographic and environmental factors that can be combined with the domes- 
tic relocation and settlement of migrants. In Chapter 9, Heidi Kurvinen, in the 
vein of feminist history methodology, uses her personal experience of getting 
acquainted with historical text mining to explore traditional and not so tradi- 
tional historians’ experiences in encountering the new digital history meth- 
ods. She notes that entering the field of digital history ‘requires cultural and 
technological capital which marginalises researchers who do not have the skills 
to conduct digital analyses by themselves or do not have access to the organi- 
sational support: Among the factors influencing the ability of researchers to 
participate, she identifies their gender. The next case study by Maiju Kannisto 
and Pekka Kauppinen in Chapter 10 illustrates the use of Named Entity Rec- 
ognition (NER) to explore Finnish audio-visual history as it is presented in 
the public radio and television online archives. Their metadata analysis reveals 
interesting peculiarities in what kind of audio-visual imaginary of the past is 
provided by the dataset, and which elements of the national history it hides. 
In Chapter 11, Matti La Mela gives an excellent example of the opportunities of 
text analysis by tracing the history of the concept of allemansrätten (freedom 
to roam) in the Finnish parliamentary debates and argues counterintuitively to 
common knowledge that the present understanding of the concept has a 
surprisingly short history. His article also takes extra care in making the 
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methodological steps transparent to readers. In Chapter 12, Pasi Ihalainen, 
with the assistance of Aleksi Sahala, uses collocation analysis to study changes 
of the concept of ‘internationalism’ in 20th-century British parliamentary 
debates. By reconstructing the meanings attached to foreign political issues 
in the British Parliament from the early 19th century, they show that the ‘inter- 
national has been associated in different ways during the various deliberations 
on the United Kingdoms membership in international organisations. 

In Chapter 13, Melanie Conroy and Kimmo Elo, with the help of network 
analysis of the metadata of a large picture archive, explore the structure and 
temporal dynamics of the geospatial social networks of the East German 
opposition movement. They show how the network method can be used for 
exploring and visualising, as well as analysing, quantitative historical data. 
Reetta Sippolas contribution in Chapter 14 uses topic modelling to explore the 
evolution of the scientific discourse in the pioneering British scientific journal 
Philosophical Transactions in the mid-18th century. In her study, the method 
of arranging the data makes topic modelling reveal previously neglected 
themes and unnoticed temporal changes in the discourse. Heidi Hakkarainen 
and Zuhair Iftikhar also use this methodology in Chapter 15, in the expanded 
form of dynamic topic modelling, to focus on the formation of the concept of 
‘humanism in the early 19th-century German-language press. They show how 
reaching reliable analysis results demands a deep understanding of the context, 
skills and time, but how the method has the potential to challenge established 
patterns of thought and underlying presumptions by providing a novel per- 
spective on the sources. In Chapter 16, Reima Valimaki, Aleksi Vesanto, Anni 
Hella, Adam Poznanski and Filip Ginter study author attribution and apply 
methods based on neural networks to explore their medieval cases of author- 
ship recognition. Their intriguing results show how the uses of ‘black-boxed’ 
computational methods can potentially help us to solve centuries-long debates 
on the attribution of authorship. In the final case study in Chapter 17, Risto 
Turunen uses advanced collocation analysis to study Finnish labour newspa- 
pers during the late 19th and the early 20th centuries. With that material, he 
takes a macroscopic approach to study expressed temporality of the papers and 
especially the 'sun of socialism, which differed from the biblical sun shining on 
all and in this ‘highlighted earthly problems: Towards the end of his chapter, 
Turunen turns his discussion to the present situation and to future aims of 
digital history. 

Jo Guldi concludes the volume in Chapter 18 by drawing a wide picture of 
the potential game-changing nature of digital history. She stresses the universal 
character and widely applicable nature of digital research methods: researchers 
of Chinese industrialisation can find a method used by a medievalist also use- 
ful to their research and vice versa. She also predicts that with the increasing 
number of digitised sources and utilisation of digital methods, we may see a 
rise of longue durée in history, which as she puts it could provide new findings 
that "border on the breathtaking: 
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New Historical Challenges and Criticisms 


Digitisation and computer-assisted research tools open new possibilities, but 
also bring novel challenges and criticisms to the history discipline. There is a 
need for a wider methodological discussion on how digital research methods 
could and should be used in history research. To be able to take part in inter- 
disciplinary collaboration, it is important for historians to have a discussion 
on what digital methods mean, and where they can lead us. The ambition is 
that the studies in this book will contribute to foster this discussion. Among 
the critical components of digital history research that are addressed in the fol- 
lowing chapters are digitisation of sources, creating metadata for digital source 
materials, human-computer interaction and digital research methods. These 
are only a few of the critical issues troubling current digital history. 

One of the most pressing questions in digital history research is access to and 
problems of digitised sources. Although also important to scholars in other dis- 
ciplines, they are fundamental to historians. The availability of consistent digit- 
ised collections with long time series is one ofthe critical prerequisites of digital 
research. Simultaneously, the existing digitised sources invoke discussions of 
their availability and usability, and what overall should be digitised. Further- 
more, digitisation also changes the object of research, as a digitised newspaper 
is not the same as the physical object of a newspaper. When digitising sources, 
we, as Mikko Tolonen and Leo Lahti have pointed out, also lose important ele- 
ments of the physical objects.? The consensus of the scholars contributing to 
this book is that the readily available digitised sources should be used with the 
same or even higher level of source criticism than before. While the existence 
of easily accessible digitised sources is a crucial requirement of digital history 
research, non-problematised use of data—a kind of ‘source myopia—has the 
potential to skew the historiography towards the most readily available data- 
bases and source material, rather than the most important or representative, 
and thus possibly motivate researchers to study them instead of the sources 
that, digitised or not, would provide the best answers to the research question 
(Chapter 3, this volume). For example, the very popular usage of newspapers 
as sources, especially for historical studies before the 20th century, is not neces- 
sarily because they are the most relevant historical sources, but is rather due to 
the simple fact that newspaper collections have in many countries been exten- 
sively digitised. 

In digitising historical sources, the digitiser faces several practical choices 
that have extensive effects on historical research. The first major question is 
what to digitise. In making such basic selections, there is a threat of repeat- 
ing and amplifying the biases of the past knowledge constructions, leaving less 
prominent and marginalised topics aside. The sources chosen to be digitised, the 
ways in which they are digitised and shared, have far-reaching consequences. 
Memory organisations, such as archives and libraries, often begin their digit- 
ising efforts from sources that are most often used by the general public and 
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researchers, and are thus considered to be more important. This common prac- 
tice creates a threat to further marginalise less prominent topics and to exclude 
less studied materials. Therefore, alongside the use of digitised sources in read- 
ily available collections, the ability for historians to digitise their own sources is 
becoming an increasingly important skill. Learning how to digitise, and setting 
the best practices for digitising, data life cycles, and sharing digitised sources 
among historians are emerging important additions to the historians toolbox. 
To increase the variety of the available digitised sources, it is valuable that histo- 
rians learn to digitise sources on their own, and whenever possible, share their 
new data. The authors of this volume use both readily available databases and 
sources that they themselves have digitised. Digitisation is time consuming, 
and therefore the sharing of data is an important means of widening the base 
of digitised sources. 

In addition to digitised sources, a key issue for digital history research is 
metadata, the data that describes and gives information about the digital data 
(sources), and especially concerning its varying quality. As Kimmo Elo points 
out in Chapter 6, ‘more attention should be paid and more resources should be 
invested in metadata creation. From this perspective, the real problem is not 
the structure of a data system itself (its ‘ontology’), but rather the process of cre- 
ating source material's metadata. The principles of adding metadata to the doc- 
uments are often rather unsystematic and not transparent, and only too often 
the usefulness of (meta)data depends on the person creating and inserting the 
metadata. For example, at the workshop described above, one research team 
planned to work on metadata of images from a public source database (www. 
finna.fi). After some trials with that material, they ended up terminating their 
project because of the overly scattered and random character of the metadata 
collection. The large amount of processing necessary to enable digital methods 
to be applied would not have made it possible to finalise their project within 
a reasonable timeframe. However, this attempted project was not in vain, as it 
partly inspired one of its participants (Elo) to write a chapter on metadata and 
digital history for this volume. 

The new kind of source material for historians in the form of digital data 
and metadata makes it important for digital historians to develop a new 
digital source criticism. Compared to the pre-digital era with large amounts 
of data in non-digital forms, the contributions in this volume demonstrate 
how digitisation instead of selective sampling allows historians to use all the 
available data in their analysis, and thus more systematic analysis. Interestingly, 
distant reading of large datasets often exposes the used databases borders 
and restrictions better than traditional sampling for close reading. For exam- 
ple, the analysis of Kannisto and Kauppinen revealed the biases and partiality 
of the studied dataset. Using digitised sources demands deep understanding of 
what the data consists of because, as Eloranta, Nevalainen and Ojala point out 
in Chapter 3, straightforward and non-problematising data usage may lead 
to missing the key issues of the data and misleading interpretations of the 
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historical processes. In big data lie opportunities also for significant misinter- 
pretations and falsifications. 

As the contributions in this volume demonstrate, undertaking digital history 
research is often more time consuming and demands perhaps more conscious 
methodological choices than the traditional history approach. When one is 
undertaking digital history research, it becomes evident that alongside the 
algorithms used, the selection, creation, cleaning and filtering of the data heav- 
ily influence the results of the computer-assisted analysis. As Johan Jarlbrink 
shows, the digital research process at many stages demands manual work to be 
done, such as data cleaning. He demonstrates that this work is not only a neces- 
sary precondition for the analysis, but is actually in itself an important part of 
the analysis, as the researcher gets to work on and read through the material 
several times, and in this way learns to know the data in depth. While the quan- 
titative digital analysis makes the conclusions more convincing, the in-depth 
knowledge of the data provides crucial qualitative understandings that guide 
the interpretations of the quantitative analysis. 

Connected to this new source criticism, there is also a need to develop what 
has been described as a digital resource criticism (Chapter 4, this volume). This 
refers to the need, in order not to draw false conclusions, to be better aware 
of the internal technological logics of the digital resources used by historians, 
such as that of a database or a search engine. Similar questions of an awareness 
of the opportunities and limitations of the available resources and methods 
have lately been raised in reference to representation and visualisation of his- 
torical data. One example of this is how Maiju Kannisto and Pekka Kauppinen 
in their study (see Chapter 10, this volume) found out that the frequencies of 
the search terms in the metadata did not reflect the actual frequencies of the 
audio-visual material to which the metadata referred, but that they were more 
an artifact of the processes of how the metadata had been produced. Both Elo 
(Chapter 6) and Kannisto and Kauppinen (Chapter 10) suggest in this book 
that archivists and historians should collaborate more and in this openly dis- 
cuss the principles and practices of metadata formation, and how they could 
best serve all the parties. 

Furthermore, the chapters of this book point out the methodological zig- 
zag between distant and close reading of data, the repetitious adjustment of 
the algorithmic parameters, the evaluation of the means of the data formation, 
its broader context and preceding research, all involved in an overall research 
process of trial and error. Sippola, Kurvinen, and Hakkarainen and Iftikhar all 
show how the choices of the researcher influence the outcomes of the research. 
For example, when using topic modelling, the testing of the results with varying 
numbers of topics is a very important step in the process of analysis. Simultane- 
ously, the scholar’s understanding of the context is essential in identifying the 
meaningful results, and to be able to differentiate them from the potential non- 
sense produced by the computer, to discern the historical signal from the data 
noise. Usage of digital research methods amplifies the research findings, but 
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they also amplify the potential of false results. Computers and algorithms are 
important helpers, but they cannot operate on their own: they always require 
human guidance. 

Despite all these challenges, the contributions to this volume demonstrate 
how computational analysis can disclose new and previously unnoticed pat- 
terns in history. For example, in Chapter 12, Ihalainen summarises the benefits 
of computer-assisted analysis for his study on conceptual history by stating that 
it revealed associations between the studied concepts, which made it possible 
to estimate trends in political attitudes and revealed particular and peculiar 
political issues that would have been very difficult to find with traditional 
methods. Along the same lines, Kurvinen states, in Chapter 9, that combin- 
ing digital analysis and close reading allowed her to identify topics that might 
have remained unnoticed otherwise and exposed new ways of perceiving the 
material, ways that could prompt novel and previously unresearched questions. 

The new digital history might also foster a wider rethinking of the parameters 
of historians' professional practice. Digital research methods create new and 
at times more stringent demands on accuracy, methodological thinking, self- 
organisation and collaboration than traditional historical research. As Kurvinen 
points out, digital environments could encourage historians to conduct their 
research in ‘a more self-aware manner when every step of the process needs 
more thought than a traditional day with paper archives. Similarly, Eloranta, 
Nevalainen and Ojala point out in Chapter 3 that collaborative research on 
digital data can lead to more efficient and accurate research, but it requires 
the development of a different professionalism from researchers. Jessica 
Parland-von Essen shows in Chapter 5 the importance of historians starting to 
manage their data in a more qualified manner to themselves so they become 
more like data curators and archivists, and including thinking about the pres- 
ervation and reusability of research data from a longer-term perspective. To 
support the development of such new practices in historical disciplines, there 
is a need for historians to participate in developing new joint practices that 
support FAIR data and thus better research. This calls for collaboration among 
historians and memory organisation specialists, and for historians to reach out- 
side of history to seek out ideas from other disciplines facing similar challenges. 

Most of the chapters in this volume were written collaboratively. Along the 
process of our project, it was confirmed that digital history research demands 
interdisciplinary collaboration, since it is rare that a historian manages to com- 
bine in him- or herself both the skills of the historian and of the programmer. 
That said, it is not necessary for the historian to become a programmer. What 
is needed is the ability to collaborate and work together in an interdisciplinary 
manner with collaborators who bring expertise from the domains of computer 
and information science.'* The above-mentioned workshop proved that fruit- 
ful collaboration with IT professionals is not only needed, but also feasible and 
beneficial. And this book proves that it can bring new knowledge, as well as 
conceptual developments, to the field of history. 
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One basic challenge, nevertheless, is that although multidisciplinarity is 
much-needed in the realm of digital humanities research, it is well known 
that not all computer-related questions or tasks carried out in digital history 
research are challenging enough to peak the interests of computer scientists. 
For example, the application of a ready code to a dataset is for a traditionally 
trained historian often too challenging a task, but rather trivial for a computer 
scientist. Thus, there is an increasing importance for universities and research 
institutions to be able to provide more mundane and routine technical sup- 
port to historical researchers through their libraries, IT support facilities or 
other means, much like before the widespread availability of easily accessible 
online databases and online sources such institutional structures were central 
in assisting historians in finding research literature and source materials. 


Conclusion 


It seems evident that history research has been and will continue to be 
increasingly influenced by society's overall digitalisation. Still, the historians 
in general would benefit from being more aware than before of the interaction 
between historical research and the digitising world around them in order to 
stay both critical and constructive towards the changes and continuities of 
today. This includes taking advantage of the latest tools, as well as exploring 
their limitations to be able to keep our methods up to date and to gain a better 
understanding of the possibilities and pitfalls of historical research in the digi- 
tal era. 

As always, the future holds both promises and threats for historians, digi- 
tal and otherwise. Although it is essentially an older condition, the skills and 
resources needed for digital history research could broaden the gap between 
history departments that are better positioned and those that are not, and 
consequently create more divisions among historians. One key issue for the 
digital historians is how to succeed to excel in using and developing new meth- 
ods, while simultaneously avoiding overlooking the values of more traditional 
research. Doing and succeeding with the new explorations, while also respect- 
ing the older known and tried ways, has often shown to be the best working 
path towards the future. 

In a similar vein as the encouragement by Jo Guldi and others in this book, 
one lesson from sociologists and historians of technology has been that users 
matter, that they, rather than being passive adopters of new technology delivered 
in black boxes, can have their say in influencing the direction of technological 
change, and at times even open up and reconstruct their tools so they better 
fit their particular needs and desires." Historians as a group can and should 
be active in making choices and guiding their discipline towards an ever-more 
digital world of tomorrow, a tomorrow that soon will be a past and needs its 
born-digital history researched. 
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After almost 50 years, perhaps we have finally arrived at Emmanuel Le 
Roy Laduries ‘tomorrow. Or maybe we are already far beyond that—not 
least as most of the authors in this collection would not identify themselves as 
doing the guantitative kind of history Le Roy Ladurie expected future histori- 
ans to be doing. As historians, we can recognise how difficult it is for history's 
actors to foresee future developments, and that while Le Roy Ladurie correctly 
predicted that historians needed to learn to harness computer technology 
for their work, neither he nor his colleagues could hardly have imagined the 
possibilities of the information technology at historians’ disposal in the early 
2020s. However, in the sense that historians should learn how to make the most 
of the ‘computer’, we feel that the historians in this book with their new digital 
and distant histories have tried to live up to his hopes by going towards and 
away from his tomorrow to reach our today and its past, present and future 
digital histories. 


Notes 


! Le Roy Ladurie 1979: 6. Rabb wrote: ‘In 1967, the basic posture of quan- 
titative historians was a mixture of brashness and defensiveness. Le Roy 
Ladurie was sufficiently impressed by the discussions at Ann Arbor to pre- 
dict that "the historian will be a programmer or he will be nothing" (Rabb 
1983: 591). 

? William G. Thomas III quoted in Cohen et al. 2008: 454. 

? See Jones 2014. 

* For some of the major books published within the new digital history, see: 
Staley 2002; Cohen & Rosenzweig 2006; Galgano et al. 2008; Schmale 2010; 
Gantert 2011; Genet & Zorzi 2011; Haber 2011; Rosenzweig 2011; Clavert & 
Noiret 2013; Dougherty & Nawrotzki 2013; Jockers 2013; Weller 2013; Gra- 
ham, Milligan & Weingart 2015; Bozic et al. 2016; Koller 2016; Brügger 2018. 

? Jockers 2013; Blevins 2014. See also Guldi & Armitage 2014. 

$ As we well know, historical ‘firsts’ are often contested and contextual. 

7 The field of spatial history evolved from within Historical Geographic 
Information Systems research starting in the 2000s. See Gregory & Geddes 
2014: x, xii, xiv-xv. 

® Sagner Buurma & Heffernan 2018. 

? Jockers 2013: 3; Vanhoutte 2013: 127-128. 

? Swierenga 1970: 5. 

1 Although these collections also have much longer histories. See Lebert 
2008. 

? Moretti 2000, 2005, 2013. See also Underwood 2017. 

'3 Tolonen & Lahti 2018. 

14 See, for instance, Jarlbrink & Snickars 2017. 

15 Foka, Westin & Chapman 2018. 

'© See also Fickers & van der Heijden 2020. 

7 See Oudshoorn & Pinch 2003. 
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PART II 


Making Sense of 
Digital History 


CHAPTER 2 


The Long Road to ‘Digital History’ 


History of Computer- Assisted Research of the Past 
in Finland since the 1960s 


Petri Paju 


Kranzberg’s First Law reads as follows: Technology is neither good nor 
bad; nor is it neutral.’ 


Historians have rarely been associated with the latest IT, or the other way 
around. In broad terms, the same applies to all IT, both old and new, and his- 
tory research; they seem a world apart, unless one counts things such as pens 
and books. In their publications, most historians make it look like their use 
of information technologies is unbiased and unproblematic. However, Melvin 
Kranzberg, who was a veteran historian of technology, reminded us that tech- 
nologies always come with consequences. With digital history, and the growing 
use of computational methods in historical research, this practice and perfor- 
mance of neutrality vis-a-vis technological tools, as well as the old stereotype, 
could be changing. 

In reality, IT such as computers has been utilised in history research since the 
1960s, as in most other walks of life. At that time, a few historians in the United 
States (and elsewhere) started to explore the usability of mainframe comput- 
ers for their work.” In over 50 years, computer-assisted history research has 
evolved, or graduated, from the tests of a very few scholars into an emerging 
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field of computational history, also called more broadly digital history research. 
Of course, one should inquire if those are phases and part of the same con- 
tinuum or rather separate developments with no tangible influence from the 
former to the latter. In any case, this development seems to be something else 
than a straightforward progression. 

This chapter focuses on the history of computer use by historians, drawing 
its evidence mostly from Finland, but with an emphasis on the researchers’ 
transnational influences. To explore this evolution, this chapter asks: What 
have historians been doing professionally with computer technology, and when 
did that begin in Finland? What were their international influences in develop- 
ing the use of computers in history research? 

Here, ‘computer technology’ refers to the technological developments 
connected to computers and IT during the research period: in this case, its 
evolution from the relatively large mainframe computers to microcomputers, 
to internet and beyond. The focus in this chapter is on historical research, thus 
mostly excluding teaching history with the help of or via IT, as well as techno- 
logical changes related to publishing. 

Interviews and memoirs, various written documents, especially digitised 
history journals, and observations (since the late 1990s) are used in answer- 
ing these questions? With these materials, the chapter aims to examine this 
development from several different levels and viewpoints. These range from 
the individual scholar(s) to their collaboration and extend into libraries and 
archives, and institutional use and support of digital means to advance research 
in the field of history. 

One important motivation behind these questions is to distance the researcher 
and readers from the present terminology concerning digital humanities and 
digital or computational history, which often seem to make studying their own 
development very confusing. Without these concepts, I hypothesise, we can bet- 
ter approach and understand historical events and trends on their own terms. 

While research in historiography had tended to value and focus on the theo- 
retical aspects of historical thinking and research, this chapter highlights the 
more practical side of carrying out historical research and thus contributes to 
a more balanced idea of how historians conduct their work. A better, increased 
understanding of the now mundane technologies and practices of historians 
is especially appropriate now that the discipline is facing yet another change 
towards an increasingly electronic and more digitised research process, with 
new and more powerful computational tools, which present challenges to his- 
torians themselves, but also to teaching and outreach to the public.* 

Further, for the international discussion, this chapter serves as a reminder of 
and correction to the US-centric or Anglo-American view of history of com- 
puting-assisted history. This too was an international and transnational devel- 
opment.? In international comparison, the number of Finnish historians was 
fairly limited. After rapid growth in the 1960s, there were, in 1970, historical 
research units in six Finnish universities, employing a total of 32 professors. 
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Since then, the community grown to the extent that, in 2015, there were 56 
history professorships in eight research units, but the profession has expanded 
greatly, especially when one counts all historians with doctoral education.* Nev- 
ertheless, from early on, this community of historians in Finland took part in 
most if not all transnational trends and developments in their field and adopted 
major new technologies used by historians in industrialised countries. In gen- 
eral, then, Finnish historians’ experience of using computers can be thought of 
as rather representative of other Western countries. The few untypical aspects 
will be highlighted. 


Computer Usage Starts in the Late 1960s 


According to the digitised version of the Historiallinen Aikakauskirja 
(Historical Journal) in Finland, the word ‘computer’ (tietokone) was first men- 
tioned on its pages in a book review in 1964." One early Finnish historian to 
make use of computers, Pertti Huttunen, later wrote that he became interested 
in using computers during that same year, in 1964, while extending his stud- 
ies and planning his doctoral dissertation in Rome, Italy. There, he first talked 
about such an option with a Finnish physicist and also visited a local comput- 
ing centre.? 

Following examples abroad, a small number ofhistorians had started to famil- 
iarise themselves with mainframe computers in the mid-1960s. The first public 
discussion about computers by historians in Finland took place in the spring 
of 1967. At that time, the Historiallinen Yhdistys ry. (Historical Association), 
or younger generation of historians, had invited historians Kaarlo Wirilander 
and Pertti Huttunen, a well-known senior researcher and a doctoral candidate 
respectively, to talk about “The historian and the computer’ At the meeting, an 
IT specialist from the Helsinki University's computing centre, Jorma Torppa, 
offered technical expertise.’ 

Before this seminar in Helsinki, historian Viljo Rasila had joined the first 
short, introductory course given by the new computing centre at Tampere Uni- 
versity. The centre had installed its first computer in 1966. The following year, 
Rasila became the first historian in Finland to publish an article about using 
computers in the national Historiallinen Aikakauskirja. In it, he mentioned the 
work of Wirilander, Huttunen, the ‘brick group’ studying Roman brick stamps 
and his own as examples of history research involving computers in Finland. 
According to Rasila, this computer use by historians was just beginning.” 

This use so far included collecting and inserting data into (punched) cards, 
which were meant for building databases (to create tables and to compile sta- 
tistics) and performing calculations. Rasila himself was applying multivariable 
analysis, and specifically factor analysis, to weigh up the various reasons for 
the civil war in Finland. That same year (1967), Pertti Huttunen published an 
article outlining his ideas about how to use computers to study Roman social 
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history. His article was published as the first volume in the series Studia Histor- 
ica from the young University of Oulu (founded in 1958) in northern Finland." 

The following year, Viljo Rasila was the first to publish a history book, a 
monograph where he applied computer-aided statistical methods to explore 
key themes in recent Finnish social history during the 1918 war. His main 
computational method, factor analysis, had been developed in the field of 
psychology. The book, Kansalaissodan sosiaalinen tausta (Social background of 
the civil war), appeared in 1968. 

Heikki Waris, a professor and social historian at the University of Helsinki, 
reviewed the study for the Historiallinen Aikakauskirja and thanked Rasila 
especially for introducing new methods for historians to use.? In the same 
issue of the journal, however, Pertti Járvinen from the computing centre at 
Tampere University discussed Rasilas book and heavily criticised his choice 
of a statistical method. In his book's preface, Rasila acknowledged the com- 
puting centre and its ‘mathematicians’ who had helped him, but, importantly, 
Pertti Järvinen had not been involved in Rasila' project. Instead, Järvinen had 
taken an independent interest in this innovative approach to history and likely 
became the first computing professional to share his ideas in this journal.? All 
in all, Rasila's study accompanied many firsts simultaneously. 

Issues of multidisciplinary soon impacted Pertti Huttunen. Based on an 
analysis by a colleague, it seems Huttunen’s dissertation manuscript on Roman 
social history faced harsh criticism from a classical philologist in Helsinki, 
which led Huttunen to move his dissertation project to the University of 
Oulu.* For sure, such difficulties and change did not support finishing the 
study, but, importantly, they were not directly associated with the new, com- 
puterised method applied by Huttunen. He never returned to work in Helsinki, 
but forged a career in researching and lecturing (for instance, about the history 
of technology) in Oulu and in other universities. 

Pertti Huttunen defended his doctoral dissertation and book The social strata 
in the imperial city of Rome in 1974. Arguably, Huttunen wrote the first Finnish 
doctoral dissertation in history to use computerised methods, although that 
same year (1974), Reino Kero also defended his doctoral dissertation of general 
history at the University of Turku, and he too had used a computerised method 
in his study on migration." 

Regarding the feedback surrounding his 1968 book, Viljo Rasila recalled in 
my interview with him that the method was widely noticed, but at that point 
in time it raised mostly confusion: 


The reception of the mathematical analysis was controversial. 
Researchers of economic and social history, Eino Jutikkala among them, 
welcomed it as opening new opportunities, but the school of historians 
following [Professor Pentti] Renvall and doing textual analysis ( renval- 
lilainen tekstianalyysiin nojaava koulukunta”) shunned it and doubted 
its usefulness.’ 
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This ambiguity is relatively easy to understand when one considers the techno- 
logical and data-processing options available at the time. Starting from main- 
frame computers and the programs available on them, computer technology 
for a long time worked mainly for quantitative research and did not really fit 
qualitative research designs. First and foremost, there was virtually no data to 
be processed in digital text formats. At this time, computers and the promise 
they represented undoubtedly encouraged historians (as well as social scien- 
tists before them) to carry out quantitative research, which grew more popular 
in universities during the 1970s. In certain history departments, this period left 
a relatively strong tradition of quantitative history research that has been more 
or less carried on ever since. 

Nevertheless, it is important to note that historians had applied quantita- 
tive and computational methods in their research even before computers were 
available. In Finland, the breakthrough of these approaches occurred in the 
early 1960s, if not somewhat earlier." In an interesting simultaneity to histori- 
ans first learning about the use of computers, the first independent department 
of economic history in Finland, at the University of Helsinki, was established 
in 1966. Unlike the “old” Finnish economic history’ which was later seen as 
rather descriptive, the new economic history became characterised by ‘system- 
atic application of quantitative methods.'s From this perspective, embracing 
computers was not a beginning nor a revolution, but part of an evolutionary 
development in the scholarship of history. It was a step further, which later 
perhaps seems to us a bigger change than it actually was. However, this longer 
intellectual background of quantitative history, going back at least until the last 
decades of the 19th century, has been studied elsewhere.'? 

How did historians compare with social scientists in computer use? For 
instance, Kullervo Rainio, later Professor of Social Psychology at the University 
of Helsinki, visited Finland's first operational computer, an IBM 650, at a state- 
owned bank soon after the machines inauguration in 1958. At that time, he 
took part in a visit arranged for the Suomen Psykologinen Seura (Psychological 
Association of Finland), and in 1960 he could learn using another computer 
in Helsinki with his complex mathematical calculations needed for simulating 
group behaviour in a computer program.” 

In general, we can safely say that social science researchers started using 
computers well before historians. In Tampere University, which until 1966 
carried the name Yhteiskunnallinen korkeakoulu (College of Social Sciences), 
Viljo Rasila had for years been in the company of mostly social scientists and 
had become familiar with their statistical methods. This environment partly 
explains his early interest in and initiative to test and use a computer for schol- 
arly work in history. 

One could also surmise that Rasila was in a position to fully cooperate with 
social scientists at Tampere University, but that was not the case. When I inter- 
viewed him, he told me that there was a major political difference between 
himself (he was more conservative) and his colleagues who, for instance, in the 
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department of sociology, were politically quite left-wing. Despite the shared 
interest in using computers, this political dissimilarity caused them to maintain 
a working distance from each other.” 

In this respect, Rasila was rather typical. For a long while in the 1970s too, I 
suspect, this was a more general pattern: when compared with social science 
departments, history departments were much more conservative, including 
politically. This points out, intriguingly, that many contextual, historical factors 
could have an effect on and limit the circulation and exchange of scientific and 
scholarly tools such as the use of computer programs. 

Tellingly of this technological milieu and the options available, it was 
predominantly a few researchers in social and general history who first started 
making use of computers. In the 1960s and the 1970s, the group of active 
history researchers totalled a few hundred, so they all knew each other and 
knew what others were doing,” even if those using computers remained a 
tiny minority. Further, Viljo Rasila penned a textbook entitled Tilastolliset 
menetelmát historiantutkimuksessa (Statistical methods in history research, 
1973, 2nd edition 1977), including examples of computer-assisted operations, 
and that book became widely known among the profession, and especially 
among history students. 

In summation, during roughly the first decade of computer use by historians, 
they used IBM and other mainframe computers for statistics, saving collected 
data, evidence, storing and processing it, forming tables, and then carried out 
various kinds of calculations and statistical analysis. 


Research Projects: The 1970s 


The early 1970s saw a new phase in historians’ use of computers when the 
technology was incorporated into research projects. Such projects were con- 
sidered fashionable, and the reorganised Academy of Finland granted funds 
for up-to-date research projects in the field of history too. In 1971, for instance, 
Vilho Niitemaa, Professor of General History at the University of Turku, pre- 
sented a newly funded project focusing on people who have emigrated from 
Finland to distant countries (known as kaukosiirtolaiset in Finnish). The project 
included what Niitemaa labelled the ‘ADP department, or individuals work- 
ing on data collecting and compiling statistics with automatic data-processing 
tools. To store data, they used punched cards. The first doctoral dissertation to 
emerge from this project was written by Reino Kero, who, as mentioned above, 
defended his thesis in 1974.” 

Conducting research in organised projects had become more common in the 
sciences in postwar decades. In the leading history journal, Historiallinen Aika- 
kauskirja, several Finnish researchers wrote about current historical research 
projects in Sweden from the late 1960s onwards, and these reports included a 
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few mentions of ADP systems which were either being tested or were already 
in use to store and handle information.” 

Thus, historians continued to use computers for organising data and for 
statistical purposes in the 1970s, but, for them, making use of the ‘computer’ 
(as technology) had also become a tool for winning research funding. Using 
computers signalled taking part in advancing research with the latest ideas and 
technology, and being at the forefront of development. 

Viljo Rasila’s expertise in computers played a major role in encouraging a 
collaborative research project called Muuttoliikeprojekti (Migration Project), 
which focused on migration within Finland between 1850 and 1910, with a 
particular focus on industrialisation. That project was led by Professor of 
Finnish History, Pentti Virrankoski, from 1977. Virrankoski also directed one 
sub-project at the University of Turku while Rasila, now an appointed profes- 
sor, led another research team at Tampere University, and Yrjó Kaukiainen 
a third team at the University of Helsinki. In this project, the workload for 
collecting data manually grew much larger than was anticipated. Still, the diffi- 
culties with the ADP programs and processing the data proved to be even more 
significant. Because of these surprises, the larger project ran out of funding 
in the early 1980s. Most of the human-collected and manually input data was 
never computerised.?* 

However, the sub-project team at Tampere constructed their database 
differently from that of the Turku team, and consequently the Tampere team and 
Rasila himself were able to use and process their materials with a computer, 
and publish research results. Importantly, the larger project had formed ties 
with the Swedish project already building a demographic database in the late 
1970s, and they exchanged experiences in international seminars." Surpris- 
ingly, there are hopes that this Tampere database could be used anew in the 
early 2020s, once again inspired by the Swedish example.” 

In principle, such databases can have a very long lifespan. Nevertheless, the 
opposite seems to have been the rule, so that many Finnish projects collecting 
and processing data in history research have produced a very ephemeral legacy. 
Their datasets were left in archives with data formats that basically died out 
within a rather short period of time. 

The international discussion concerning historians' use of computers was 
increasing from the late 1960s onwards. In that exchange, Finnish historians 
rarely contributed publications, although Viljo Rasila, at least, published two 
articles in international journals such as in the 1970 volume of Economy and 
History. Importantly, however, during the 1970s and continuing well into the 
1980s, Finnish scholars had relatively dynamic transnational communications, 
especially with their Estonian colleagues from the Soviet Union who had pio- 
neered using computers in history research. Juhan Kahk was one of several 
such Estonian colleagues who published studies (using both the Finnish and 
the English languages) also in Finnish history series.” 
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Microcomputers for Text Processing: 
The New Typewriter (Plus) 


The impact of the computer on historians' practice was not only as a calcu- 
lator, but even more so as a word processor. Typewriters were already being 
advertised for historians in Historiallinen Aikakauskirja in 1916. It took time, 
however, before they began to be widely used by historians. And relatively soon 
afterwards, the latest products of the IT industry emerged: smaller computers 
that could be used as an advanced typewriter. The spread of personal comput- 
ers (PCs) or microcomputers opened up new possibilities for historians in the 
early 1980s.*° 

In Finland, Jussi T. Lappalainen was the first person to write to historians 
about the possibility of using a computer to write texts. He had heard of such 
a novelty from his son Vesa, who studied mathematics. Lappalainen explained 
that he first thought of writing archival notes on a computer in place ofusing the 
long-used edge-notched cards (or edge-punched cards, neulakortti). Father and 
son then co-wrote a short article entitled ‘Historical research without papers; 
which was published in Historiallinen Aikakauskirja in 1983. At the time, Jussi 
Lappalainen, who had previously worked at the University of Jyvaskyla, was as 
Associate Professor of Finnish History at the University of Turku.*! When the 
first, still quite expensive, microcomputer landed in the history department's 
office in Turku, his colleagues were afraid of using it. Lappalainen, however, 
was convinced about the devices potential and wrote another article entitled 
"Making text on the screen, after which his colleagues began to telephone him 
to glean some clarification. As a former publishing editor, Lappalainen also 
persuaded the popular Finnish novelist Kalle Páátalo to migrate to using a 
computer for his work. The learning phase involved some text vanishing from 
the computer's memory (or from the writing software) and this made the angry 
author revert to the typewriter for a while.” Despite the new technology, then, 
the (anticipated) main use of these new machines was familiar; it was typing. 
Computers replaced typewriters, and most of the historians started using com- 
puters as not-yet-so-advanced typewriters. Yet, social science historians soon 
discovered ways in which the PC could do more. 

In 1985, a new historical research project at the University of Helsinki started 
using a microcomputer to save and study materials. Project members examined 
the Finnish famine of the late 1860s (1860-luvun suuret nälkävuodet) based 
on the latest developments in social science history. In that project, they uti- 
lised either quantitative or qualitative methods (or both) on a variety of materi- 
als. For both types of method, they developed new best practices using software 
for building databases and for word processing, including one project-member, 
Kari Pitkánen, writing a concise guide book for fellow historians entitled His- 
toriantutkija ja mikrotietokone (The historian and the microcomputer, 1987) 

Many preferred to wait and see, however. Several historians have confessed 
that they themselves hesitated and postponed adopting the novel PCs in the 
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mid-1980s, but by the beginning of the 1990s, nearly all had started to at least 
write with microcomputers.** A significant factor in this transition was the 
increased user-friendliness of PCs in the form of graphical user interfaces (in 
place of the command line interface). At the same time, PCs became cheaper 
and consequently more common. Soon after, the media started to excite peo- 
ple about a new information network: the internet. Considering the changes 
recently introduced by microcomputers, it is unsurprising that for many (older) 
historians the new online world of information networks remained for most of 
the 1990s quite distant. 

Compared to older mainframe technology, microcomputers opened up a 
whole new spectrum of uses for historians to choose. Typing or text processing 
was by far the most widely adopted of these new uses and thus in many ways 
the most important one. But, in addition, on a PC one could also keep records 
and notes, and later draw maps and graphs, and take time to learn other new 
uses. Again, much of the development was gradual.* 

Meanwhile, many other people were using microcomputers too. These 
included genealogists, who launched their own journal Sukutietotekniikka 
(Computer technology for family research) in 1984, and who worked together to 
insert data in digital formats, and later digitised parish registers and made them 
available online (HisKi). In some universities, linguists developed corpus lin- 
guistics and even historical linguistics. In the early 1980s, the Helsinki Corpus 
of English Texts was initiated. This ground-breaking digital text collection was 
completed and publicly distributed in 1991. Quite a few historians became 
aware of these endeavours, but they remained distant to historical research. 

Overseas, groups of historians established for themselves organisations 
such as the Association for History and Computing (AHC), which was pro- 
posed at a conference at the University of London in 1986. The AHC was dedi- 
cated to the use of computers in historical research and in ‘promoting the use 
of computers in all types of historical study, both for teaching and research.” 
Unlike their colleagues in many other countries, Finnish historians did not 
form a national association for history and computing, and to the best of our 
knowledge, they consequently took part to a very limited extent in this inter- 
national discussion. 

With every major change, quite a few historians at first postponed adopting 
the new technology. Who were these non-users of the (new) technology? Until 
well into the 1980s, they were those historians who were relying on textual 
analysis—basically, the majority of people in most history departments. They 
could use card files to make archival notes and to store their data, and other 
such manual or mechanical tools, and they used typewriters or perhaps had the 
department's typist transcribe their writings. 

Gradually, for instance, cultural historians also switched their typewriters to 
PCs. Perhaps it took them a few more years, but it did happen, and soon, in the 
1990s, it was only the most senior historians who did not change to writing on 
a computer, but hung on to the typewriter. 
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At the same time, Finnish researchers committed to the new cultural his- 
tory avoided numbers and statistics, and in general quantitative methods. 
For instance, their colleagues in Italy and Germany more often used numbers 
and calculations to study microhistory. This avoidance can be regarded as a 
counter-reaction towards the general emphasis on quantitative methods such 
as statistical approaches in the 1970s. Instead, cultural historians studied textual 
evidence in the light of the then recent linguistic turn. Their emphasis was on 
using qualitative methods, especially ‘close reading’ of texts, as well as discuss- 
ing and exploring narratives. Over time in the late 20th century, close reading 
became a leading (often the main or even only) method for legions of historians 
and other people studying texts, so much so that the literary historian and Pro- 
fessor Franco Moretti termed his new and different computer-assisted method 
‘distant reading’ Inspired by the Annales School of historians, he coined the 
term in 2000.** It has subsequently gained popularity as a response and comple- 
ment to the dominance of close reading. 


Enter the Internet: Anticipating a Digital Revolution? 


In the early 1990s, the younger generation of historians discovered the internet, 
or networks of computers, that had been first built in the United States in the 
1960s for military purposes and only came into wider, academic use by scien- 
tists during the 1980s. Furthermore, some historians soon took part in creating 
a new, virtual dimension to the world. In Finland, they first tested Gopher- 
based internet pages (before the html language) which were in use by 1994. At 
that time, the World Wide Web, or the Web, after being created at CERN, had 
begun its successful expansion as the information medium over the internet. 

One of the early Finnish projects was the Electronic Centre for History 
Research in Finland. It first opened in late 1995. The following year, it joined 
forces with other related projects, and these were transformed into a new 
national cooperation. Named as the Agricola network, this was a joint effort 
among historians in the universities, libraries and archives, and it was officially 
launched in 1996. 

The new Agricola site brought together people working with or interested in 
history, created new avenues of communication and enabled them to discuss 
their relevant issues in a very popular email list, H-verkko, nationally. They 
aimed to inform others and share news, as well as publish online. Impor- 
tantly, one key component for the network builders consisted of educating 
historians and keeping them abreast of the internet's latest relevant develop- 
ments. This included thinking ahead and writing about the possible futures of 
history research in the digital era: an anticipated digital revolution and what 
that might entail.“ Further in connection to the Agricola network, a group of 
historians started to study IT history, especially in Finland, thus improving the 
shared understanding of living in a society in which computer technology was 
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gradually applied everywhere.” Out of the Agricola network's publishing activ- 
ities grew Ennen ja nyt (Then & Now), in existence since May 2001, which was 
the first national, refereed online journal in history. 

To summarise, historians were now using computers and their networks for 
searching and gathering information, including data about archives, and they 
sometimes even accessed the actual sources that someone had downloaded to 
their pages. This could easily be achieved transnationally, and for quite some 
time it seemed national borders were becoming less and less important. The 
burgeoning virtual world and its sites first complemented and then slowly 
began to replace former foundations of historians’ work such as library indexes, 
travel to archives and archive guides, followed by books, phone books, etc. In 
scholarly communications, electronic mail or email correspondence instead of 
postal letters proved triumphant in the ‘internet age.“ 

For the first time, historians were also becoming familiar with sources that 
were ‘born digital, such as email letters and digital art, and discussed the future 
of electronic sources. Two extreme questions surrounded whether everything 
would be saved electronically (a burden for historians) or whether the new 
electronic sources (such as early www-pages) would be deleted or otherwise 
lost within a relatively short time, leaving future historians without important 
materials from the 1990s. Thinking about it now, the latter seems closer to 
what has actually happened. Furthermore, the digital revolution that took place 
proved to be slower than expected and transformed into a digital evolution that 
eventually invaded every aspect of life during the 2000s and onwards. 

In the 50-year period examined here, the contextual changes for historians 
have been significant, ranging from the expanding universities to the evolu- 
tion of the Finnish society at large. The historical profession in Finland in the 
early 1960s consisted of perhaps fewer than 100 people active in conducting 
research. The number of history professors in Finland was 17 in 1960, and 
it grew to 32 in 1970 to approximately 46 in 2000 and to 10 more in 2015, 
while the number of research units (larger university departments) rose from 
five to eight in the same time period. However, the number of university- 
educated history researchers (PhD) and lower-level positions grew much more 
extensively, particularly from the late 1990s onwards. In addition to universi- 
ties, there were historians carrying out research elsewhere, especially in a few 
major institutions such as archives and the National Library.“ 

Starting in the 1990s, the Finland-based multinational corporation Nokia, 
selling new mobile phones, led the country's high-tech investments and 
image, and Finland became a leader in many IT developments. This probably 
encouraged also technologically open-minded historians to explore the new 
possibilities that the novelties might offer. Meanwhile, especially since 2000, 
the profession has both specialised further and internationalised heavily, and 
historians have in general perhaps become less and less knowledgeable of their 
domestic colleagues compared with experts abroad. Historians in the universi- 
ties have also confronted an ever heavier competition for (external) research 
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funding, which has contributed to their willingness to adopt new methods 
and ideas. 


Digitising Sources and Offering Them Online 


In many ways, digitisation of historical sources had its roots in microfilming 
similar materials. The state (national) archive in Finland started a project to 
microfilm documents in the late 1940s. It was the new general manager of 
the archive, Yrjó Nurmio, who led ground-breaking efforts to film impor- 
tant sources abroad, first in Sweden and West Germany, and thus made these 
archival collections that were considered relevant for Finnish historians easily 
available to researchers in Finland, on microfilm readers. Later in the 1950s and 
1960s, Finns could also microfilm Soviet materials.’ 

During a longer period of time, a large collection of historical newspapers 
was microfilmed in Finland. Foreign newspaper collections could be purchased 
for use in Finnish libraries and universities. Microfilming and their use had 
then continued for about three decades when automatic data processing (ADP) 
started to become another option to store and access primary sources. While 
the history of microfilming might sound ancient and wholly irrelevant for his- 
torical researchers in the 20205, this legacy is in fact a pertinent background to 
the digital newspaper collection. 

The National Library at Helsinki had already established the Centre for 
Microfilming and Conservation in 1990, located in the small town of Mikkeli 
in Eastern Finland. They aimed to create a comprehensive microfilm collec- 
tion of Finnish newspapers and journals. Meanwhile, the internet made its first 
breakthrough as a new and exciting channel to distribute information in digital 
formats in the early and mid-1990s. 

Digitisation of cultural heritage began in Finland after the mid-1990s, with 
the Mikkeli centre playing a central role. From the perspective of newspaper 
collections, an essential turning point was the launching of the Nordic project 
Tiden in 1998. In the Finnish case, the digital collection of newspapers is for 
the most part based on microfilms, which means that both the quality of the 
microfilm and the quality of the original newspaper have an important impact 
on the accuracy of optical character recognition (OCR), which varies from dec- 
ade to decade. After a busy few years, the National library was able to open the 
Historical Finnish newspaper archive online in 2001. 

The first collection of digitised newspapers already covered several decades 
of the 19th-century press. Historians could now carry out some of their histori- 
cal research using digitised original materials, over the internet, via their own 
computers in their own offices. 

Since its inauguration in 2001, this major online press archive has been con- 
stantly expanded and its user interface, such as search options, improved. These 
significant investments have made the National Library's DIGI Collection of 
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newspapers and periodicals published in Finland arguably the most used 
historical digital source material in 2018.” In fact this collection is so complete 
especially regarding the 19th-century newspapers that in many cases they are 
enough for answering the researcher's question/s. This has made some research- 
ers critical and asking if not the research questions where chosen so that one is 
able to limit his/her study into consulting only the digital materials, relying on 
keyword searches, and applying the rather conventional qualitative methods. 


Evolving Digital Humanities and Emerging ‘Digital History’ 


Gradually, in the 2000s and the early 2010s, an increasing number of historians 
became aware of and familiar with the massive amount of digital texts from 
primary sources that were processed by memory institutions such as libraries 
and archives around the world into digital formats and made available online. In 
retrospect, suddenly, there was an abundance of material suitable for qualitative 
and quantitative analysis online. Anyone could perform simple yet comprehen- 
sive keyword searches in these vast collections. It was (and is) easy to forget that 
such searches might be anything but perfect (due to the low quality of OCR 
results) because the accuracy of the search process was very difficult to assess. 

Most researchers rapidly realised that one could only perform ‘close reading’ 
on a tiny fraction of those online sources because even just skimming them 
all went beyond anyone's capabilities time-wise. This gradually led progressive 
historians to think about obtaining and/or creating more adequate, comput- 
er-assisted methods and the means to get the most out of this wealth of digital 
sources. Among these, one can count the above-mentioned literary historian 
Franco Moretti. 

Meanwhile, computerised methods and software with a longer development 
history such as GIS came to be used by a few historians in Finland in the 2000s. 
They used GIS to place and study historical information on maps of various 
kinds. Compared to GIS, textual analysis with computational tools and the 
newly emerging ‘big data’ was still very much being invented and developed 
during the early 2000s. Nevertheless, researchers of AI had made important 
progress in cooperation with linguistics since the 1980s, and a research field 
called natural language processing (NLP) was advancing. Based on complex 
statistical mathematics and algorithms, this work promised new tools for ana- 
lysing texts too. The first peer-reviewed journal article where the rather recent 
method of ‘topic modelling’ was applied for historical materials was published 
in 2006.9? 

In Finland, too, the early 2000s witnessed inventions in software turned into 
new digital tools that historians could use. For instance, in the late 1990s, a 
group of medievalists and the National Archives had built an electronic version 
of Finland’s medieval sources (medeltidsurkunder), producing an online data- 
base called Diplomatarium Fennicum.* In the mid-2000s, Tuomas Heikkilä 
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joined forces with some IT specialists and together they started developing 
computational methods to group medieval texts. Their aim was to create a fam- 
ily tree, a stemma, based on the dis/similarity of those early scripts, in order to 
better study their origins as well as influences on each other.” Over the years, 
this new interdisciplinary cooperation has led to several international scholarly 
meetings called Studia Stemmatologica, as well as publications developing fur- 
ther stemmatological analyses. 

The availability of these digital materials combined with the introduction of 
new tools sparked many developments during the 2010s that are changing and 
will renew history research. Starting towards the middle of the decade, sev- 
eral national conferences and seminars have been organised to discuss such 
new research. The first two textbooks concerning historical research and digital 
methods were published in Swedish and Finnish, in 2014 and 2016, respec- 
tively In 2015, the major research funder for historians, the Academy of 
Finland, opened a call for projects to The Digital Humanities Academy Pro- 
gramme (2016-2019), which encouraged many to pay more attention to devel- 
opments going on in this new research area. Some, but not quite all, of the 
outcomes of this wave of new research are presented in this book. 

All this technological development and expectations for ever faster and wider 
analysis of the historians’ ‘big data has also re-emphasised ‘old’ problems (stem- 
ming from the 1990s), such as the poor quality of OCR-processed digital texts. 
How can we overcome this obstacle to the use of these latest computational 
research methods? Challenges like this partly motivated some historians to 
plan the project Computational History and the Transformation of Public Dis- 
course in Finland, 1640-1910, funded during 2016 to 2019, in which the low 
OCR accuracy in the digitised newspapers and periodicals was circumvented 
by basically using a method originally designed for bioinformatics—in this 
case, modified to recognise the reoccurrences of similar text passages system- 
atically in several millions of pages of primary sources. 

These challenges are highlighting our need for developing novel ways of 
digital source criticism, but also for taking new, fresh perspectives on the digi- 
tal evolution that surrounds us. An eye-opening example is offered by Johan 
Jarlbrink and Pelle Snickars, who studied the specific ways in which newspa- 
pers are transformed in the digitisation process, and concluded that in fact the 
massive digitisation has created large amounts of digital noise: ‘that is millions 
of misinterpreted words generated by OCR, and millions of texts re-edited 
by the auto-segmentation tool, resulting in a new—and, moreover, unevenly 
distributed—layer being added to the shared cultural heritage." This reinter- 
pretation suggests and confirms, first, that we need to learn to live and come 
to terms with that digital noise and, second, that a totally new and so to speak 
born-digital (that is, generated by computer technology) demand for histori- 
ans’ tools in computer technology will be to reduce that digital noise. 

Meanwhile, this emerging ‘digital history’ research has also been explored. 
In one inquiry, Finnish historians raised doubts about this new concept and/ 
or identifying themselves with it. In other words, many responders expressed. 
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uncertainty about whether or not they were digital historians and/or digital 
enough, meaning that, as of 2016, few historians saw themselves as digital his- 
torians.”” Among the critical issues that were identified through the inquiry 
were the importance of creating better, up-to-date information channels of 
digital history resources and events, providing relevant education, skills and 
teaching by historians, and the need to help historians and IT specialists to 
meet and collaborate better and more systematically than before. 

One can hypothesise that two camps of historians were formed in the late 
20th century, distinguished by their use of computer technology. On the 
one hand, everybody was more or less taking advantage of text processing 
(working with text files and mainly writing), PCs in general and the internet, 
in various ways. On the other hand, there were those sub-fields that (had) also 
continued with quantitative methods, such as statistics, for a long time. But 
many historians concentrated mainly on text processing. It is important to 
note that the new methods of digital humanities, based among other things on 
developing NLP (technology), were more eagerly adopted, and embraced even, 
by those researchers who focused on processing texts. To be more precise, it 
was a fraction of those historians who embraced the latest methods and also 
appropriated the term ‘digital history; while the social and economic historians 
adhered for a longer period to their seasoned ways in quantitative methods. 

Further, these new ideas and the digital humanities scholarship have in Fin- 
land, as elsewhere, been brought together in new laboratories for humanistic 
research. By far the largest effort nationally in this field, the Helsinki Centre for 
Digital Humanities, or HELDIG, was established at the University of Helsinki 
in 2016. By 2020, HELDIG has evolved into a vibrant centre of teaching and 
research in digital humanities, including digital history. The centre’s multidis- 
ciplinary research groups, led by Eero Hyvénen and Mikko Tolonen among 
others, have concentrated on semantic web and building linked open data 
portals, such as the Sampo series, intended also as historians’ research tools, 
and on using large but overlooked collections of library metadata to quantita- 
tively examine the evolution of book publishing and the press over hundreds of 
years, respectively. In addition, a group of Finnish historians has been actively 
involved in the association Digital Humanities in the Nordic Countries and its 
DHN conference series held annually since 2016. In 2018, HELDIG was one of 
the key organisers of the third DHN conference, this time arranged in Helsinki. 
The overarching theme of the conference was Open Science, which challenges 
current and future historians in yet other ways. Historians and other schol- 
ars involved in the field of digital humanities may expect all of this to further 
advance their digital research capabilities in the future.** 


Conclusion 


To better understand where the present digital and computational history has 
come from and its place in the historical discipline, this chapter has studied 
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the historians’ use of computer technology, together with some associated 
technological influence in history research in the case of Finland. It is argued 
here that such an open and broad approach to these phenomena serves best to 
expose the complex and already quite extensive roots of the present-day digital 
history approaches. 

Certainly, historical research has many layers of history with the digital, and 
this relationship continues to be formed in the mutual shaping of the research 
field, including its people and ways of doing things, technology and the society 
at large. Perhaps we can even say that the field of digital history today has not one 
but many histories, and its history remains open to a variety of interpretations. 

On the one hand, it is difficult to exaggerate the changes that computer tech- 
nology has brought to the work of historians (too) during the recent decades. 
Combined with other changes, the technological advances have positively 
enabled and enlarged historians’ study options in unforeseen ways and scale, 
while they have also guided and reformed the research designs (see Table 2.1). 
On the other hand, it has been a long and circuitous route from computers 
being used for processing statistical data in the late 1960s (Viljo Rasila) and 
thereafter being used mostly by historians undertaking quantitative research, 
up until several technological advances and also disruptions (microcomputers, 
the internet and the World Wide Web, and related software), to the present day, 
where historians are able to perform their whole research process digitally, from 
planning to gathering materials, carrying out the analysis, including statistics 
(if any), writing their interpretation and then publishing the results online. 

Nevertheless, it is evident that the use of IT was heavier in some sub-fields 
than in others, for many reasons. Those reasons range from theoretical under- 
pinnings to copyright law, which has slowed both digitising and distributing 
certain primary sources from the 20th century. 

From early on, divisions were created by different approaches to under- 
standing history and consequently how the research was done. For a long time, 
starting from mainframe computers and the programs available on these, com- 
puter technology worked better for quantitative than qualitative research. That, 
in turn, might be one reason why the new ‘digital history’ was, albeit decades 
later, more eagerly welcomed by (some of the) historians analysing texts. This 
type of source had been the focus of their qualitative work for decades, and 
by the 2010s they needed new tools to handle the massive amounts of textual 
sources that organisations such as major libraries around the world had digi- 
tised and made available online during the last 15 to 20 years. 

What remained the same during the 50 years in between was that the inter- 
pretations were made by the human mind of the historian. Unless perhaps 
those interpretations also changed while the technological environment and 
tools for making them were transformed? This is quite conceivable, which 
reminds us that we still know very little about the impact that computerisation 
has had on history as a field of study and its products from historical narratives 
to its theories of change and continuity. It is also time for the students of histo- 
riography and even philosophers of history to take a serious, deep look into the 
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Table 2.1: Milestones of computer use by Finnish historians. 


1967 Two articles on using computers for scholarship: Viljo Rasila in 
Historiallinen Aikakauskirja; Pertti Huttunen on Roman social history. 

1968 First monograph to use computer-aided statistical methods: Viljo 
Rasila, Kansalaissodan sosiaalinen tausta (Social background of the 
civil war). 

1974 First two history PhDs using computerised methods: Pertti Huttunen 
and Reino Kero. 

1970s Computers in research projects: focused particularly on migration and 
mobility. 

1983 First article (Lappalainen and Lappalainen) about PCs for 
historians’ use. 

1990 Centre for Microfilming and Conservation established in Mikkeli. 

1996 The Electronic Centre for History Research in Finland (SHEK) for 
internet use and digitising sources begins (in the Mikkeli centre 
and elsewhere). 

2001 Historical newspapers opened for research online and Ennen ja Nyt 
journal established online. 

2014-2016 | First two textbooks about digital history published in Finland. 


Source: Author. 


practical aspects of ‘doing history,” where computer technology has become 


so central. 


Whether embracing the new tools or shunning them, we should, however, 
remember what Melvin Kranzberg (a leading historian of technology) famously 
formulated as his first law. In our case, Kranzberg’s rule, quoted as the epigraph 
to this chapter, means that we should take historians’ thoughts and feelings 
about technology seriously. At times, they probably saw the computer technol- 
ogy as good, bad or both. More importantly, it reminds us that the computer has 
never been ‘just a tool, and this is why we should collectively think more about 
using these changing products of IT developers and their bearing on our work. 
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CHAPTER 3 


Towards Big Data 
Digitising Economic and Business History 


Jari Eloranta, Pasi Nevalainen and Jari Ojala 


Prologue 


An ambitious project was initiated in 2002 and concluded by 2007 by Finnish 
economic and business historians to analyse digitised news agency data in order 
to create a model to predict the behaviour of business enterprises. This project, 
entitled MetaSignal (later MetaAlert), was a joint venture between historians, 
journalism researchers, engineering scholars and economists working at the 
University of Jyvaskyla and the Tampere University of Technology. The aim was 
nothing less ambitious than to create an artificial intelligence (AI) that could 
learn from the past to predict the future. 

The AI was intended to compile automatically, categorise and analyse avail- 
able online information to find so-called weak signals from a massive flow of 
information. To ‘teach’ the AI, the project used a massive news agency data- 
base, including roughly 20 million business newsfeeds from the early 1970s 
to the early 2000s. For the first time in Finnish historical research, the project 
also used digitised full-text New York Times newspaper data from the 1850s 
onwards, together with databases containing information about listed compa- 
nies and stock market prices over an extended period of time. 
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Needless to say, this bold initiative failed as the project did not have sufh- 
cient human resources or computational power circa 15 years ago to reach its 
goals. Nevertheless, as for the outcomes, the project did identify publications 
and networks that were valuable at the time and at least interesting even from 
today’s perspective. One must bear in mind that the internet was still a new- 
comer at the turn of the millennium; thus, there were still many uncertainties 
as to which direction it would develop and which would be the most usable 
tools to find information on various topics. Moreover, the databases on emerg- 
ing markets of the internet were also at a developing stage, and so was the price 
of information: the price of annual use of the databases used in the project 
roughly doubled every year. 

By analysing the data available from open sources at the time and comparing 
it to the data purchased from the databases in the market, the project found, 
for example, that the very origins of the contemporary newsfeeds could be 
traced to few, well-established and old news agency firms or media companies.’ 
It was not until the emergence of the digital camera, smartphones and social 
media when the supremacy of these companies began to collapse, at least to a 
certain extent. 

The project members did not necessarily even notice at the time how 
fast the environment around them was changing. The project participants trav- 
elled to Stanford to learn about the latest trends in Silicon Valley and report 
their findings to the steering group. Therefore, it was the historians and the 
other humanists who were the first to inform the others in the meetings with 
the funding agency Tekes (Finnish Institute of Technology and Innovation, 
nowadays known as Business Finland) about interesting emerging companies 
in the United States, like Facebook? 


The MetaSignal project was just an outcome of a long tradition of compiling 
and using massive databases, distant reading methods and, most importantly, 
sophisticated methods among economic and business historians to analyse 
numerical and textual data. The use of a massive database to predict future 
trends in the MetaSignal project was not, obviously, a ground-breaking idea. 
On the contrary, computerised methods have been used in social sciences in 
this respect at least from the early 1960s, when the first attempts were made 
at the RAND laboratories.’ 

Economic and business historians have been the forerunners in the digital 
history data gathering and analysis for decades. This chapter attempts to dis- 
cuss the major developments internationally and, in some specific cases, in 
Finland in the fields of digital economic and business history, concentrating on 
some of our own projects, as well as research outcomes by economic and busi- 
ness historians at the University of Jyvaskyla and within our networks. We are 
not claiming that our projects are unique or ahead of their time in the field of 
economic and business history—on the contrary. However, we feel that these 
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projects are indeed illustrative cases (such as the aforementioned MetaSignal) 
about the possibilities and challenges facing historians in the digital era. 

After a section introducing the use of digitised data in economic and busi- 
ness history, we will briefly discuss the methodological challenges in the use 
of these methods, followed by sections concentrating on event data analysis 
and challenges involved in using various databases (with some examples). 
Thereafter, we focus our narrative on the use of digital sources and methods in 
business history. In the concluding discussion, we will address the challenges 
and opportunities offered by digitised sources, followed by some exposition of 
the remaining challenges. 


Big Data in Economic and Business History 


Big data is at the heart of economic history research, and has already been so 
for decades.‘ Big structures, large processes, huge comparisons, by Charles Tilly, 
a famous historical sociologist, was a book published in the mid-1980s that 
highlighted some of the early efforts in such scholarship. Tilly’s classic studies 
urged researchers to study the macro-level societal structures systematically, to 
better understand large processes of change.’ Tilly was also one of the forerun- 
ners of ‘social science history, pushing sociological understanding to advance 
historical research. Economic historians were also part of this process and, to 
a certain extent, the first ones to explore and exploit the possibilities of social 
scientific methods and data in historical research. 

Since the time of publication of Tilly’s book, the datasets compiled and used 
by economic historians have become larger and more varied: numeric data is 
nowadays more often ‘born digital’; and besides numbers, even economic his- 
torians are today more often using high-resolution digital images and digitised 
texts. The quantity of available data has increased dramatically, whereas the 
costs of storage have decreased—even though there is now a new challenge 
for academia arising from the costs of the best datasets and digitised library 
collections.° As Guttman and colleagues (2018: 269) note: A key characteristic 
of modern “big data” is that the volume of stored data exceeds human analytic 
capacity and pushes against the boundaries of currently-available computing 
power. For that reason, the magnitude of “big” is continually growing? 

By its principles, economic history research does not differ substantively from 
other types of historical research: economic historians compile data from origi- 
nal (archival) sources to provide answers to questions posed by scholars. What 
differs, though, is that the questions asked are often based on testable theoreti- 
cal frameworks originating from social sciences and usually require a massive 
amount of data that, in turn, cannot be analysed without sorting the data into a 
database format, as well as by using some sort of quantitative methods. However, 
economic historians were forced to compile these types of datasets themselves 
for decades, whereas today there is a large amount of readymade data available, 
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starting with various text corpuses (for example, digitised newspapers), statis- 
tical data provided by different national and international authorities (such as 
census records) and databases compiled by researchers, authorities and private 
enthusiasts in different fields, including genealogical associations. The latter 
type of ‘citizen participation or citizen science to compile data will most likely 
increase in the future, as well as different kinds of official, linked register data. 
Nevertheless, even today, researchers studying especially the ancient and early 
modern eras are forced to mainly compile the datasets by themselves, whereas 
those concentrating on the more contemporary periods and topics have to 
face the challenges associated with the already existing datasets. 

Using digitised sources is at the very core of international economic history. 
Computerised methods were embedded into the economic history research 
during the ‘Cliometric Revolution’ in the 1960s and 1970s, when the so-called 
"historical economics' tradition emerged first in the United States, then also 
later in Europe. The first researchers in this tradition were mostly trained as 
economists—such as Alfred Conrad and John Meyer, and then Robert Fogel 
and Douglass C. North—using their theories, models and econometric meth- 
ods to study and understand controversial topics in history, like the produc- 
tivity and profitability of slavery. Obviously, mainstream historians were not 
totally convinced about their studies and methods, especially as some of the 
advocates of the 'new economic history' took historians head on vis-à-vis many 
big topics." By the turn ofthe millennium, this battle had settled down, as more 
historians have adopted cliometric methods to be a part of their toolkit and as 
'social science history' has become more common. Simultaneously, economists 
are taking history research more seriously. Nevertheless, the major journals in 
economic history today are more oriented towards economics than they were 
back in the 1950s.* 

The most obvious outcomes of the 'new economic history' have been the his- 
torical growth studies in different countries, compiled together in the Mad- 
dison Project database maintained at the Groningen University? Historical 
national account series and other long-run societal and economic time series 
form a basis for all comparative macroeconomic studies of history. These 
include data on population, prices, wages, structure of the economy (size of 
agriculture, industry and services), foreign and domestic trade, urbanisation, 
central (and local) government expenditures and, finally, GDP (per capita) that 
is based on all the other data series listed above. Historical national accounts 
have made comprehensive comparisons over long periods of time more cred- 
ible between a growing number of countries. These datasets have been game 
changers in the field and have occupied a substantial role in the debates over 
long-run economic growth. Angus Maddison (2001) published his initial global 
growth figures spanning 2,000 years at the turn of the millennium, but he had 
already started putting these numbers out in various publications from the 
1980s onwards. Obviously, his early figures were rather tentative, and the GDP 
per capita estimates in general for many developing states were too low. Recent 


Towards Big Data 49 


efforts, for example, by Stephen Broadberry'" and others, have exposed some 
of the flaws in these figures and extended our knowledge of not just European 
and Western development patterns, but also economic performance in Asia and 
Africa. These figures are now changing the debate over global trade and the so- 
called Great Divergence; that is, when and how China fell behind the West in 
the last 500 years." In recent studies, the focus has shifted to account for new 
areas of interest, such as well-being and inequality." Consequently, the exist- 
ing Finnish historical national accounts from 1860 onwards were compiled by 
Riitta Hjerppe and the growth studies research group in the 1970s and 1980s, 
comprising 13 volumes in total, and they are still the benchmark in the study of 
Finnish economic history.” 

Business historians, in turn, have been more focused on actors and related 
activities in the economy, whether by private persons, entrepreneurs, busi- 
ness enterprises or other groups. These actors represent the ‘visible hand’ of 
the aggregate economic system. Research on these actors, in turn, helps us to 
understand the evolution of economic structures. By looking at the American 
19^-century railway companies, Alfred Chandler Jr. (1977) created the basic 
framework for the business strategy research. The methods used by modern 
business historians are more often qualitative, and the quantitative methods 
used are typically more descriptive than statistical ones.'* Nevertheless, big data 
and methods used to analyse digitised databases have become more impor- 
tant also for business historians. This is simply due to the fact that either the 
data produced by entrepreneurs and enterprises over time are in most cases in 
numerical form and/or the volume of data is massive." Even the early modern 
businessmen such as 13th-century Commenda traders in Genoa or late-18th- 
century Finnish businessmen produced a massive amount of letters and ledg- 
ers; some of those have lately been converted into a digitised format. The recent 
historical business data is already of digital origin. The shift to increasingly dig- 
itised material has enabled researchers to utilise larger quantities of material in 
qualitative research in future studies, including new ways to collect and analyse 
the material, including the use of AI in data mining and analysis. 


Use of Quantitative and Qualitative Methods to 
Tackle Digital Sources 


The use and analysis of quantitative data has been a hallmark of economic history 
research, especially since the turn towards more quantitative economic history, as 
we have already discussed. The aim of this more economics-influenced 
research has often been to attempt to find causal relationships between differ- 
ent phenomena; namely, to measure what were the factors explaining changes 
in phenomena proxied by various time series, cross-sectional or panel data. 
For example, during the past decades, there have been many attempts to 
compile data on, better measure and understand the dynamics of pre-industrial 
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economies; for instance, to clarify the role of women, children and families in 
the pre- and early-industrialising societies.'* Alongside the time series (or pan- 
els) of economic development, much attention has been placed on the study of 
equal or unequal distribution arising from this development." 

From the 1950s onwards, econometric tools such as regression analysis have 
emerged as a typical way of estimating the relationships between economic 
variables. Regression analysis is today a common tool both in economics and 
social sciences, and also in economic history. Thus, in order to understand 
what has been written in the field during the past decades, one has to be famil- 
iar with at least the basics of this method; or, rather, the set of regression and 
other econometric techniques for modelling and analysing several variables. 
More commonly, regression analysis estimates the conditional expectation of 
the dependent variable vis-à-vis a set of independent variables; for example, 
what was the importance of education, investments or policy indicators for the 
economic growth or, as we have done, the effect of new technologies for wages 
of different skill levels of employees.’ 

Certain aspects of regression analysis have also been criticised, such as 
the over-reliance on measures of statistical significance. Historians are par- 
ticularly worried how such methods are suited to the analysis of time series 
as the observable and unobservable factors might change over time, and also 
the sources of data are similarly subject to change. Some of the research has 
become perhaps even overly technical by nature, thus losing its relevance for 
broader historical narratives.” Finally, causal relationships are hard to pin- 
point, especially from more qualitative data," and in econometrics the very 
idea that causality could be ascertained from regression analysis has become 
quite contested. 

Another way to analyse causal relationships is by using counterfactual mod- 
elling: namely, to analyse a scenario of *what if' the phenomena had not have 
occurred or a different historical trajectory had taken place. Economic history 
also has a long tradition of counterfactual analysis, starting from the early writ- 
ings of Nobel-prize-winning economic historian Robert Fogel. Those models 
have, however, been criticised time and again by historians.” 


Event Data Analysis 


Although the methods used by economic historians could and should be 
criticised for certain shortcomings, they are nevertheless something that other 
historians might wish to emulate when using digitised, big data’ sources. These 
methods can also be used when analysing qualitative, textual datasets, by intro- 
ducing ‘binary thinking’ to the analysis; that is, coding the textual data to ena- 
ble quantitative analysis. We have used, for example, ‘event data analysis’ to 
code actions and activities found in historical data, like the 'strategic actions' of 
companies. The basis for event data analysis can be found in historical events 
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that are arranged according their sequences. The coding of events (for exam- 
ple, strategic actions) enables comparing different actors, such as companies or 
business groups.” 

While reducing texts to ones and zeroes might lead to over-simplifications, 
the use of more open methods, such as fuzzy set Qualitative Comparative 
Analysis (fs/QCA), has proved to be suitable for historical inquiries, as the set- 
theoretic relations frequently reveal more plausible causal relations than simple 
correlations.” Moreover, these types of methods can also be used to extrapolate 
larger datasets from smaller samples, in which typically statistical analysis has 
been near impossible. Often, the dichotomy between small-N qualitative case 
studies and large-N statistical studies has been overstated.” Essentially, they 
follow the same underlying logic of research. The best way to avoid the pitfalls 
of each is to engage in both or combine the strengths of each approach. These 
types of methods have been further developed by some Finnish business his- 
tory and management scholars in particular.” 

In international comparisons, comparable data, contexts and how the data 
helps make broader points about processes all play a role. For Finnish histo- 
rians, though, even the question of the relevance of comparisons might some- 
times alter the way in which we think about the sources and data. One of our 
own examples is from some years ago when we were using a large-N database 
which comprised information on Finnish and Swedish sailors. Thus, an obvious 
perspective for us was to compare these two countries in our analyses. For read- 
ers outside Scandinavia, however, this did not make much sense: the reviewers 
and editors of journals saw Sweden and Finland rather as complementary than 
interesting comparative cases in terms of our research question, and the paper 
was rejected time and again, before we fully realised this challenge and changed 
the paper accordingly." 

This type of categorisation is something we have tried to develop further 
also in our bibliometric work focusing on analysing trends in business history 
scholarships. As categories of the contents of journal articles in the ready-made 
databases (such as WoS or Scopus) are always subjective, we introduced cer- 
tain measures to make such categorisations more objective in our study. Obvi- 
ously, these are again methods used previously in other fields, but ones that 
can also be adapted to the study of economic and business history debates. 
For example, we engaged several researchers to do categorisations of previ- 
ously published business history articles simultaneously, and then either used 
‘consensus’ or average categorisations, or results of ‘voting: In the latter case, 
the 'votes' (zeros or ones by each individual doing the categorisation) for each 
category were summed up, and thereafter these sums of votes were calculated 
as a percentage of the maximum possible number of votes. These percentages 
were then taken to be the share of each category and as basis for further sta- 
tistical analysis, namely to study why certain business history articles received 
the most citations.” The next obvious step is to introduce these bibliometric 
techniques to book-format publications, which would help us gauge the trends 
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in a publication format that historians prefer, again broadening the analysis of 
interdisciplinary transference. 


Making Big Data Work: Databases and Their Challenges 


As we have shown here, economic and business historians have been engaged 
in creating their own databases for a long time by using a variety of primary 
sources.” The data collected from the original primary source material has 
typically been stored as digital images, Word and Excel files on the research- 
ers’ own computers, and perhaps distributed via email or cloud services, when 
sharing was needed, for example, to make a common writing project easier. 
That is the case even today in many instances. Regardless, currently there is a 
growing number of ready-made databases that have to a certain extent eased 
the work of economic and business historians, yet at the same time they have 
provided new types of challenges. First of all, the availability of these databases 
has motivated researchers to study topics for which the data is (easily) available, 
and to find connections between those variables for which we have informa- 
tion. To study Finnish economic and business history, it might be challenging 
to use some of the international datasets, as information on Finland might be 
lacking, or is otherwise irrelevant or even incorrect. Some of the most impor- 
tant international databases, however, do have some data for Finland as well, 
like the Maddison Project database described above; Clio-Infra (http://www 
.clio-infra.eu/), EH-net databases (http://eh.net/databases/), Global Price and 
Income History (http://gpih.ucdavis.edu/) and Swedish historical monetary 
statistics 1668-2008 (http://www.riksbank.se/research/historicalstatistics). 

The challenge is, however, that in many of these datasets the data on Finland 
is to a certain degree confusing and even misleading. This, in turn, relates to 
the fact that the data has been compiled from national statistical sources or 
from previous research. In the Finnish case, we simply still lack some of the 
basic research; thus, the datasets are using the existing figures for Finland. 
The Maddison database, for example, uses the growth figures for Finnish GDP 
(per capita), for certain benchmark years, for the last 2,000 years by using 
inter- and extrapolation methods. Nonetheless, Finnish growth studies have 
produced more exact figures so far only from the 1860s onwards. Currently, 
though, there is project at the University of Jyvaskyla to fill the gap from the 
1500s to mid-1850s in order to have more reliable, internationally compara- 
ble time series for Finland as well.” This will, hopefully, make Finland more 
appealing as a unit to be used in international comparisons: currently, Finland 
is lacking from a number of international studies simply due to the fact that 
comparable data does not exist yet. 

Some of the international databases have been especially valuable also for 
Finnish economic and business historians. Beside those noted above, two spe- 
cific datasets recently used by Finnish scholars are worth noting: the Soundtoll 
Registers Online (STRO) compilation (soundtoll.nl) and the Swedish Seamen's 
House enrolment database. 
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The STRO compilation is a good example of how digitised, large databases 
can be constructed with reasonable costs and in a limited amount of time?! 
The STRO database is based on the archival data created in the Danish Elsinore 
in the Sound Toll that was established in the late 15th century and lasted until 
1857. The STRO database includes roughly all the ships and their cargoes that 
passed the Danish Sound from 1634 to 1856, comprising 1.4 million ships. Of 
these ships, roughly 2.496, that is 35,000 ships, came from or headed towards 
Finland. In order to understand Finnish international trade and shipping, the 
STRO is especially important as the Danish Sound was the only route for Finn- 
ish export and import trade to markets beyond the Baltic for centuries. The 
Baltic trade as a whole, in turn, was of utmost importance in understanding 
the early modern and modern growth of Europe, as this trade was, as Milja van 
Tielhof puts it, ‘the mother of all trades.” The Danish Sound data used in previ- 
ous research? was mainly based on the Sound Toll Tables compilation by Nina 
Ellinger Bang and Knud Korst in the 1920s and 1930s.* Their data, though, 
covered only the period up until the early 1780s, and later Hans Christian 
Johansen extended the period up until the mid-1790s. Thus, from the Finn- 
ish perspective, the STRO is fascinating as it covers the era from the late 18th 
century until the mid-19th century, which was in many respects an emerging 
era for Finnish export trade and shipping. 

Nevertheless, although the STRO data is highly valuable for research in gen- 
eral and for Finnish history research in particular, it also entails many chal- 
lenges that can at the moment only be partly solved in the online dataset. The 
names of places and commodities are currently being made uniform, as well as 
the different units used (weights, sizes, etc.), and, moreover, there are a num- 
ber of mistakes in the dataset that might have been present already when the 
entries of the original customs data were made or later during the data-entry 
process of the database. At the moment, there is an extensive project in Leipzig 
being overseen by Dr. Werner Scheltjens to modify the data further; this ver- 
sion, STRO 2.0, will be launched in the coming years. Finnish economic histo- 
rians are also collaborating closely with this work in order to have even better 
data to use to study Finnish long-term trade patterns.” 

Another important database used by Finnish economic historians is the 
Swedish Seamen's House enrolment dataset. This database was compiled at 
the turn of the millennium by the Swedish National Archives in collaboration 
with the Swedish Genealogical Association. The database includes roughly 
650,000 enrolment cases and 26 million data points from nine Swedish coastal 
towns and one Finnish town (Kokkola). Researchers at the University of 
Jyvaskyla gained full access to the database more than 10 years ago, only to 
find out that there were many challenges with the data. Indeed, the database is 
a good, or bad, example of the challenges inherent in these types of databases. 

First, the researchers did not have full access to the data in the beginning, 
which made quantitative analysis impossible. Many similar genealogical data- 
bases have been designed to help users find detailed information on, for exam- 
ple, their ancestors— not to perform statistical analyses. Second, the data did not 
need to be exact in terms of values and figures to serve genealogical inquiries 
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and, therefore, in the datasheets sometimes numeric and textual data became 
mixed. This all meant that it took almost a decade for the researchers first to 
clean up the data, enrich it with additional information, and then standardise 
the monetary and other units (especially tonnage measures of ships) before 
it could really be used. This led to the third challenge that, again, is unfortu- 
nately rather common in many research projects using ready-made digitised 
databases. Namely, the database used in the research is to some degree dif- 
ferent from the one that is available online at the Swedish National Archives 
website, and the researchers cannot, in accordance with the signed contract, 
publish the data they are using. Thus, hopefully in the future, the Swedish 
National Archives will publish the modified dataset separately on their website; 
this would be helpful for the research community at large, as this database is 
certainly highly valuable and the results have already been published in some 
of the most notable publishing forums.” There is already an initial agreement 
between the project researchers and the archive to publish the data in one form 
or another. 


Digitising Business History 


The magnitude of ‘big’ is also continually growing in the field of business 
history. In practice, qualitative researchers can utilise much larger volumes 
and types of data than before and, on the other hand, different tools of analysis. 
The major development trend of recent decades is the diversification of the 
research field. Although the mainstream debate is still focused on businesses, 
entrepreneurs and entrepreneurship, the perspectives of research have widened 
over the last decades to cover a broad range of business-related themes. For 
example, the importance of interest groups, entrepreneurship of women and 
minorities, developing economies and environmental issues as part of business 
practices have emerged as major topics of discussion.? Even though most of 
the research is still being carried out in corporate archives, relying largely 
on textual material such as minutes and memos, it is because of the broadening 
of the scope of inquiry that the source material is quite sparse. 

Finland's strength in business history research has traditionally been a com- 
prehensive and open public archival service, which has guaranteed access 
to first-class material. One of the most important of these institutions is the 
National Archives (Kansallisarkisto), which has provided access to, among 
other things, abundant government documents, but also many archive col- 
lections of individuals and some private organisations. In Finland, the state 
has a strong position in society, and state documents, for example, contain not 
only information on legislation and administration, but also a huge variety of 
useful reports produced by various government organisations. The availability 
of sources has been supported by legislation under which a public author- 
ity document is in principle public.“ Moreover, this also covers state-owned 
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enterprises, depending on their legal form of action. Such archives also cover 
very interesting research sites that are difficult to access in many other coun- 
tries. The archives of the state-owned telecom company (PTL Tele/Sonera) are 
available to researchers until the year 1994, when the company changed its 
legal form from a public authority into a limited liability company. An even 
more important archive for Finnish business historical research is the Central 
Archives for Finnish Business Records (Elinkeinoelämän Keskusarkisto), where 
the archives of many Finnish companies are currently located and easily acces- 
sible to scholars. Often, such archives require a licence to use, which typically 
does not form an obstacle to academic research. For example, a large number 
of private telecoms documents are available up until the 2010s. 

Despite the quality of the archive service, access to archival material and its 
quality are still key issues. For a private company, handing over the archive 
to the archival establishment is voluntary. The quality and usability vary on 
a case-by-case basis. At worst, even the material of important companies has 
been virtually lost. For example, when a large company, whose older archive 
sources are conveniently located in the National Archives, was asked about its 
late-1990s archives, it became clear that the company had outsourced the man- 
agement of these archives to a private archive management company, which 
in turn had transferred the material to its own repositories. Worst of all, there 
was not even a list available for that material. On the other hand, the private 
archive management company does not provide any 'extra services without 
an extra charge. Hence, to even find out whether the archives are relevant for 
scientific use would require a laborious and costly preliminary inquiry. On 
the other hand, some companies have already digitised their archives. How- 
ever, even if the material is in digital format, there is no guarantee that it will 
be accompanied by a proper search engine and metadata, or that the archive 
would be properly organised and/or that the researcher would have full access 
to the database. 

Business history has a long tradition of using digital images and optical char- 
acter recognition (OCR) techniques, similar to economic history. In this way, 
scholars themselves have digitised a considerable amount of material. These 
have already greatly accelerated the utilisation of broader amounts of informa- 
tion. These are mostly private, rather limited databases. When talking about 
the possibilities of these images and personal collections, it should be borne 
in mind that these are not usually complete sets. A business history scholar, 
rarely paid for their efforts in this regard, usually has to photograph only the 
‘necessary’ documents. For this reason, these private collections usually serve 
specific research questions. It is clear that large-scale digitisation ofthe material 
should be done by archives or large, well-funded projects in a professional and 
systematic way, leading to a publication of the data in a commonly used form. 
Unfortunately, the digitisation projects of aforementioned key Finnish institu- 
tions are still only in their infancy. Digitisation has, first and foremost, captured 
the oldest material. On the other hand, new machine reading technologies are 
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promising and will surely improve the usability of data in the future. Up until 
this point, very positive developments have taken place vis-à-vis search engines, 
making it easier for the researcher to find material from traditional archives. 

Discussion of the business history method has touched upon the usabil- 
ity of the history research method in social sciences (such as organisational 
research), and how business historians can contribute to these discussions. 
Qualitative history research that takes into account temporal processes, con- 
texts and coincidences has also been seen to be instrumental in building and 
modifying theoretical understanding.“ 

However, defining the method of historical research has become a prob- 
lem: instead of a clearly defined method and source series, qualitative history 
research often takes advantage of different perspectives and sets of sources 
that may change as the research process evolves. The problem arises because 
historians are not accustomed to describing these research processes with 
the precision that is customary in social sciences, which in turn has begun to 
take replicability seriously. This debate has highlighted the need for business 
historians to pay more attention to describing their methods.” This require- 
ment can also be viewed against the development of digital analytical methods. 
Since the idea of such methods is to automate the work, this requires event data 
coding in different ways, which in turn requires precision as well as continuous 
justification of choices. In this way, methodological precision and connection 
to theoretical models will be a more central part of the historian’s daily routine. 

Digitisation of business history sources and methods allows not only the use 
of qualitative data in larger quantities, but also the more intensive research col- 
laboration. A particularly interesting example of using digital methods in busi- 
ness history research pertains to the ‘Digital History of Telco and Exchanges in 
Finland and Sweder consortium.” The project includes researchers, social sci- 
entists and historians from Aalto University and several Swedish universities, 
including the Stockholm School of Economics. Moreover, one of the authors of 
this chapter has participated in this collaboration. At the heart of this 'DigHist' 
project is a database, which includes the digital business archives of four busi- 
ness enterprises. Two telecommunications companies and two exchanges from 
Sweden and Finland have been selected for the project. These archives have 
been digitised for their most relevant parts. The coded digitised material is 
shared between the members of the consortium. For example, the database 
contains key sections of the Finnish state-owned telecom company's (PTL Tele/ 
Sonera) archives (95 digitised archive boxes). Some of these have been digitised 
from the collections of the Finnish National Archives, but the others have been 
digitised from the material held by the current company Telia. Consequently, in 
one project, we were able to perform searches on all of the 764 Executive Team 
meetings (including attachments) that took place between 1981 and 1998. 
The software used also allows for the indexing of material and linking differ- 
ent documents to each other. Materials related to an interesting event can be 
assembled into a set of materials that make it easy to view relevant documents 
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together. The sheer amount of data and the search functions make it possible to 
efficiently compile information on the desired topics. 

At best, this type of working method enables quantitative exploitation of 
qualitative material and analysis. In addition, by working closely together, the 
project scholars have been able to develop a unique research design centred 
around collaboration across institutions and disciplines. Data availability, a 
common desktop and teamwork enable a highly effective and accurate research 
process combining different areas of expertise. Practical experience has shown 
that such a method also poses challenges. Finding information about a huge 
amount of data requires good knowledge of the case and the materials. To know 
what kind of potentially interesting things have happened in the company 
being researched (namely, the terms used in the company at different times), 
it is important that someone in the research group is knowledgeable about the 
subject and sources of the case. Again, easy and partially mechanical availabil- 
ity of the material may blind the scholar. Too narrow a focus on certain source 
series and ‘relevant’ documents may obscure the importance of the historical 
process and context, leaving the strengths of historical research untapped. In 
any case, such a way of working has proved to be a promising way of combining 
digital tools and theoretical knowledge with methods of historical research.“ 


Discussion and Further Challenges 


Digitisation is part of the development of technology and society, and hence 
something that naturally enhances economic and business history research. 
Its direct impact is related to the available material, the amount and usability 
of which are greatly improved as digitisation proceeds further. In many cases, 
such as in business archives, digitisation could potentially proceed much faster, 
but such efforts have been hampered by the lack of funding and expertise. 
Although digitising research materials and methods does not bring anything 
other than more efficient tools for managing the research process, at its best it 
can also be used as a tool for speeding up and facilitating the development of 
methodology and science, as well as international collaborations. 

For some, digitisation itself changes human and social sciences. At the heart 
of such ‘Google of archives’ thinking are massive increases in the amount of 
data and improvements in search functions. According to Berry and Fagerjord 
(2017), up until now, digitisation in human sciences can almost completely be 
understood as a mechanism for sorting availability and dissemination of mate- 
rial in large quantities. The discussion has dealt with technical issues that are 
considered to be part of the archive's or library's work. In fact, even the tools 
are not new in principle. As we have discussed here, databases and quantitative 
methods have been used for a long time, and even before PCs became avail- 
able. Economic history has been a forerunner in this and the lessons learned 
by economic and business historians—both successes and failures—could and 
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should, we argue, be used also more broadly among other historians. As Berry 
and Fagerjord conclude, the actual contribution of digitisation has to ‘move 
beyond the purely instrumental and mechanical automation of processing of 
humanities materials.” 

Excessive and straightforward trust in digitisation is methodologically prob- 
lematic. Using, for example, keywords, the desired documents can be found 
quickly from a large cache of data, yet a poor choice of keywords can lead the 
scholar to miss key contributions. Moreover, context and other areas are easily 
missed. The same applies to quantitative research, in which the researcher still 
needs to understand what dimensions and weights meant at different times 
and contexts. In that case, the researcher may unknowingly twist the history of 
an event in a way to reinforce his or her own hypotheses.” In reality, the need 
for someone to know the empirical material thoroughly does not disappear 
with the new digital collections and large-N methods. This is also an important 
starting point in all historical studies using, for example, regressions analysis: 
you need to know the units you are analysing before you analyse them, and 
what information you might still be missing from your analysis. 

Furthermore, the sheer amount of information is a methodological problem, 
because the researcher needs to separate the necessary pieces from a massive 
amount of data. The committee which explored options for developing Finn- 
ish state-owned businesses published a 154-page report in 1985. However, the 
same committee delivered into the National Archives material that takes up 
about two shelf metres. Most of these consisted of unorganised documents, 
which contained numerous versions of the same memoirs, meeting invitations 
and drafts." If this material were to be digitised and searched for, in practice, 
the same document would appear among the results tens of times, but only a 
few documents would increase our knowledge of the subject itself. We had 
a similar challenge with the MetaSignal project: using the large database con- 
taining news agency newsfeeds actually delivered the same information in the 
worst cases dozens of times. On the one hand, we could use this information 
to show the ‘hype’ around various topics, but on the other, it hindered the pos- 
sibilities of performing proper quantitative analysis. 

The creation and use of large digital collections require collaboration between 
several state and private actors. In the Finnish case, a specific role is played by 
the Finnish National Archives, which is responsible for the official documents 
created by the different state authorities. The Central Archives for Finnish Busi- 
ness Records (Elka), in turn, is the most important institution vis-a-vis pri- 
vate business archives and collections. The official collections (and to a certain 
extent also private archives) can be divided into roughly three groups, each 
entailing some specific challenges in today’s digital world. 

The first group consist of the ‘old’ paper archives, the total volume of which 
today is roughly 220 shelf-kilometres at the Finnish National Archives. Only a 
small fraction of this 'old' archival data has been or will ever be digitised; today, 
there are already 85 million digitised pictures available at the National Archives. 
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The goal is to digitise 20 per cent from this old material; mainly archives 
from the 1920s onwards. Nevertheless, the bulk of this material will also remain 
in paper format in the future. 

Second, a large amount of paper format official documents resides with dif- 
ferent state authorities that have been created since the 1970s, during the era 
of bureaucratisation, which are to be moved to National Archives in the com- 
ing years. The volume of these documents is around 135 shelf-kilometres. 
As it would not be sensible to build new warehouses to house such archives, 
the material will presumably be digitised en masse and the originals will be 
destroyed. On the whole, this will mean that future historians who are look- 
ing for official documentation from 1970 onwards have to contend with only 
digitised archives. To a certain degree, the same is occurring in the private sec- 
tor as well. 

The third challenge relates to the so-called born-digital documents. The 
new service acquire for born-digital material is to be launched in year 2021; 
moreover a pilot project for private archives was under construction in 2020. 
Whatever the archival solution will be, both regarding public and private docu- 
ments, the format will be digital. Thus, future historians will definitely need 
versatile skills to be able to use the digitised data and archives effectively. 

In general, historical sciences have been the pioneers in the utilisation of 
information technology since the 1970s. Since then, digitisation has inevitably 
progressed, but as we have noticed, often slowly and sporadically. However, the 
advancement of digitisation has embodied undeniable advantages. Economic 
history at large has been forging ahead of other fields of history in using big 
data, digitised sources and quantitative methods. By using the rhetoric embed- 
ded in the theoretical debates of the discipline itself, economic history might 
eventually lose its comparative advantage as other, larger fields of history are 
catching up in using these data and methods, which in itself would be a good 
turn for debates about big issues in history, such as trade, slavery, development, 
environment, conflicts and so on. 

There is a plethora of other challenges ahead for the field of history too, as 
well as economic and business history in particular. First, what materials should 
be digitised and when? This reflects the priorities among the scholars and the 
institutions that produce and maintain such records. Often, those priorities are 
not the same, which can create friction among the stakeholders. It also concerns 
resources and technologies available to facilitate such processes. Second, who 
has access and to what? While many archival collections are open access, some 
are not. And most published articles and books are not open access, which lim- 
its their use among scholars who are not institutionally linked and especially 
those who are located outside Western academic institutions. The same, of 
course, applies to the first concern about what materials are digitised; namely, 
for example, are business and economic records from the developing world less 
accessible those from the West? Third, new methods are emerging to analyse 
both the data itself as well as research trends, including bibliometric and AI 
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methods. These methods can offer great insights, but they can also be used to 
direct funds towards the most ‘popular’ types of research at the expense of, at 
least in terms of perception, more marginal topics. This can foster groupthink 
and could be detrimental to smaller, interdisciplinary fields like economic and 
business history. Finally, there are great challenges among the various fields 
of history to remaster quantitative techniques to be able to make use of the 
new ‘big data; given that the so-called cultural turn from the 1980s until the 
early 2000s had no real interest in quantitative analysis and that the ‘Cliometric 
Revolution’ often took economic historians to departments of economics. Now, 
there is a greater demand to bring back quantitative historians, who have the 
requisite skill set to work with these types of data and methods. However, to 
achieve that, humanities will have to compete with other fields, with higher 
wages and better resources, so this process will likely take some time. 


As stated at the outset of this chapter, our MetaSignal project failed 15 years 
ago. Would it be possible to create such an AI with historical data to predict 
future today? A number of similar software solutions have already been cre- 
ated, using various kinds of data sources. However, with similar sources and 
algorithms that we were using, it is highly unlikely that the project would suc- 
ceed even with today's computational power. Moreover, although digital meth- 
ods in humanities and social sciences have developed significantly over the past 
15 years, the use of these methods is still lacking behind the digitisation of 
sources. Nevertheless, it would certainly be beneficial to have historians on 
board to develop similar kinds of projects also in the future. AI methods are 
certainly already in use to deal with large datasets and analytical projects, 
and eventually they will become the cornerstones of historical analysis more 
broadly, although historians will have to exercise careful control over these 
efforts and remember the points of caution we have reflected on in this chapter. 


Notes 


' Ojala 2005: 19. 

? The early project outcomes are summarised in Ojala & Uskali 2005. 

? See esp. Andersson 2012: 1411-1430. 

1 See, e.g., Calafat & Monnet 2017; Ojala 2017: 446-456. 

> Tilly 1984. Tillys research focused heavily on finding structural patterns 
in history in the long term, as evidenced in his classic study of European 
urbanisation and warfare (Tilly 1990). 

$ Gutmann et al. 2018: 270, 280. 

7 McCloskey 1978. See also Conrad & Meyer (1958). The debate about 
slavery came to a head over the book by Fogel & Engerman 1974, which 
was criticised by many, including Gutman 2003 and Sutch 1975. See 
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also Kolchin 1992 on critique of the follow-up book. For a review of the 
Cliometric Revolution and its achievements, see, e.g., Goldin 1995; Greif 
1997; and Carlos 2010. 

* Whaples 1991; Eloranta, Ojala & Valtonen 2010. 

? See the Maddison Project database, version 2018, https://www.rug.nl/ggdc 
/historicaldevelopment/maddison/. On this database and its use, see Bolt & 
van Zanden 2014; Bolt et al. 2018. 

© Broadberry & Gupta 2006; Broadberry, Custodis & Gupta 2014; Broad- 
berry et al. 2015. 

" See, e.g., Pomeranz 2009; de Vries 2010. 

? See esp. van Zanden, et al. 2014. 

? Hjerppe 1989. 

^ Eloranta, Ojala & Valtonen 2010; Ojala et al. 2017. 

5 For a broader discussion of data and methods in business history, see, e.g., 
Decker et al. 2015. 

16 de Vries 2008; Humphries & Weisdorf 2015. 

17 See, e.g., Hoffman et al. 2002; Milanovic, Lindert & Williamson 2010; 
Piketty 2015. 

18 See esp. Allen 2001; van Zanden 2009; Ojala, Pehkonen & Eloranta 2016. 

? Ziliak & McCloskey 2016. 

? See esp. Sala-i-Martin 1997; Reckendrees 2017: 3. 

? Mahoney 2000; Ketokivi & Mantere 2010. 

? See, e.g., Atack 2018. 

2 See esp. Lamberg & Ojala 2006: 22-25; Lamberg, Laurila & Nokelainen 
2006: 307-312. 

^ For an introduction to this method, see Fiss 2011: 393-420. 

? See, e.g., Mahoney & Goertz 2006; Jordan et al. 2011. 

2 See esp. Pajunen 2008: 652-669; Järvinen et al. 2009: 545-574. 

? The final published article is Ojala, Pehkonen & Eloranta 2016. 

? On the models, see, e.g., Ojala & Tenold 2013: 17-35; Ojala et al. 2017: 
305-333. 

? [n the English case, we can go as far back as the 17th century; see Broad- 
berry et al. 2013 for further discussion. 

?' See jyu.fi/growth. 

>! About the project, see: Gobel 2010: 305-324; Veluwenkamp & Scheltjens 
2018. 

? van Tielhof 2002. 

? See, e.g., Åström 1962; Åström 1963; Åström 1988. 

* Bang & Korst 1930. 

3 Johansen 1983. 

% See, e.g., Eloranta, Moreira & Karvonen 2015; Moreira et al. 2015; Ojala & 
Räihä 2017; Ojala et al. 2018. 

?' See esp. Ojala, Pehkonen & Eloranta 2016. 

% Cf. Gutmann et al. 2018. 
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?? See, e.g., Amatori & Jones 2003; Scranton & Fridenson 2013. 

? The Freedom of Information Act (621/1999). 

^ Kipping & Üsdiken 2014; Üsdiken & Kipping 2014; Wadhwani & Bucheli 
2014; Decker 2017. 

? See, e.g., Yates 2014; Decker, Kipping & Wadhwani 2015; de Jong, Higgins 
& van Driel 2015; Stutz 2019. 

£ See https://blogs.aalto.fi/digihist/. 

^ One of the papers resulting from this project (Cheung, Aalto & Nevalainen 
2019) was selected as one of the best research method papers at the 
Academy of International Business conference in 2019. 

^ Berry & Fagerjord 2017: 14. 

“6 See, e.g., the discussion of the problems involved with a study conducted 
by Timothy Leunig and Hans-Joachim Voth on smallpox deaths, concern- 
ing both the sources they used and the methods involved. See Vervaeke & 
Devos 2018. 

? See the Committee Report 1985:2 (Valtion liikelaitoskomitean mietintö); 
committee archives in the Finnish National Archives. 
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CHAPTER 4 


Digital History 1.5 


A Middle Way between Normal and Paradigmatic 
Digital Historical Research 


Mats Fridlund 


History is one of the oldest and most conservative humanist disciplines, which 
begs the question how it could react to the current third generation or ‘wave’ 
of digital history and its new potential to transform the practice of historians’ 
research. History as a discipline is according to some digital historians at a 
crossroads, ‘in a transitory moment” and ‘standing on the edge of a conceptual 
precipice. The ‘understanding and practice of traditional history’ has been said 
to be ‘facing a fundamental “paradigm shift” and ‘straddling a line between 
revolution and continuity and that the resolution of ‘this tension is going to 
be a central part of historians tasks over the coming years? Some historians 
claim that ‘digital history has become the buzz-word for avant-garde histori- 
cal scholarship in the digital age? while others worry about external interests 
and pressures from funders, governments and industrial stakeholders and the 
possibilities of reallocation of resources and ‘fear for the hermeneutic character 
of the humanities, and a reduction of humanities research to data crunching 
or to a view that proclaims the search for underlying patterns and structures 
in human history and culture to be its essence: The overall concern is that his- 
tory will be transformed into a new primarily quantitatively focused discipline 


How to cite this book chapter: 

Fridlund, M. (2020). Digital history 1.5: A middle way between normal and 
paradigmatic digital historical research. In M. Fridlund, M. Oiva, & P. Paju (Eds.), 
Digital histories: Emergent approaches within the new digital history (pp. 69-87). 
Helsinki: Helsinki University Press. https://doi.org/10.33134/HUP-5-4 


70 Digital Histories 


where traditional ‘analogue history’ focused on narrative and close and deep 
reading of primary sources will be marginalised. 

This chapter want to take these hopes and fears of a paradigm shift in his- 
tory seriously and I will use my training as a historian and theorist of modern 
science and technology to analyse and conceptualise what such a paradigmatic 
change of historical science might mean. To do this, I will discuss what I have 
elsewhere identified to be the main methodological strands of computational 
digital history and in this use research from history and philosophy of sci- 
ence on revolutionary and paradigmatic change within science, and especially 
Thomas Kuhns historical and philosophical research on scientific revolutions.? 
In doing this, I have made the choice to, rather than provide an empirical case 
study of the practices of current digital historians, combine a description of 
some of the current practices within historical research with a larger concep- 
tualisation of what I and other digital historians have identified as some of the 
central methodological elements of the new digital history. 

The reason for this is that I consider it to be crucial for current and future 
digital historians to analyse and think reflectively about their new emergent 
historical practices. We need empirical descriptions of current historical prac- 
tice, but we need critical reflections and conceptualisations even more. As a 
conceptually minded historian, it is crucial for me to have conceptual tools 
that helps us better see and better understand. In this, I am inspired by Joseph 
Schumpeter’s statement on the foundation of historical analysis: 


Analytic effort starts when we have conceived our vision of the set of 
phenomena that caught our interest, no matter whether this set lies in 
virgin soil or in land that had been cultivated before. The first task is to 
verbalize the vision or to conceptualize it in such a way that its elements 
take their places, with names attached to them that facilitate recognition 
and manipulation, in a more or less orderly schema or picture.‘ 


Thus, the central task of this chapter is to attempt to conceptualise and attach 
names to some of the central elements of the new emerging digital history 
practices so that we can start our analytic efforts to better understand the 
new emerging digital history. 


Paradigmatic Change in Sciences, History of Science 
and Historical Sciences 


There are especially two main areas of Thomas Kuhns research on scientific 
revolutions that are of relevance to understanding the current changes within 
digital history. The first is Kuhn’s research on what he described as ‘the second 
Scientific Revolution of the 19th century and on the historical impact of quan- 
tification of earlier qualitative research fields. Quantification, Kuhn argued, was 
central for understanding the historical development of scientific research and, 


Digital History 1.5 71 


in 1961, in an article published just before Structure of scientific revolutions and 
at the same time as the historical sciences were entering their first quantita- 
tive 'Cliometric Revolution, Kuhn investigated ‘the effects of introducing quan- 
titative methods into sciences that had previously proceeded without major 
assistance from them.’ Kuhn starts his article describing how the Social Science 
Research Building at the University of Chicago on its facade 


bears Lord Kelvins famous dictum: ‘If you cannot measure, your knowl- 
edge is meager and unsatisfactory: Would that statement be there if 
it had been written, not by a physicist, but by a sociologist, political 
scientist, or economist? Or again, would terms like ‘meter reading’ and 
‘yardstick’ recur so frequently in contemporary discussions of episte- 
mology and scientific method were it not for the prestige of modern 
physical science and the fact that measurement so obviously bulks large 
in its research?* 


In his article Kuhn studies how the physical sciences achieved this exemplary 
and aspirational character for other sciences to follow, something which still is 
very much with us in the current debate on digital humanities and digital his- 
tory. The reason for physics' status as the contemporary model science, Kuhn 
posited, could be understood as coming from that 


physicists, as a group, have displayed since about 1840 a greater ability 
to concentrate their attention on a few key areas of research than 
have their colleagues in less completely quantified fields. In the same 
period, if I am right, physicists would prove to have been more suc- 
cessful than most other scientists in decreasing the length of contro- 
versies about scientific theories and in increasing the strength of the 
consensus that emerged from such controversies. In short, I believe 
that the nineteenth-century mathematization of physical science pro- 
duced vastly refined professional criteria for problem selection and that 
it simultaneously very much increased the effectiveness of professional 
verification procedures.? 


And the reason for this in its turn came from how the physical sciences "came 
to make use of quantitative techniques at all. Perhaps surprisingly to some, 
then and now, the physical sciences had not always been based on measure- 
ments and mathematics. Some parts of physics, what Kuhn described as the 
‘traditional sciences’ in the form of astronomy, optics and mechanics, had 
developed considerably quantitatively before the first scientific revolution, 
while the relatively new ‘Baconian sciences; ‘the study of heat, of electricity, of 
magnetism, and of chemistry, had not been a systematic field of inquiry pre- 
viously, but 'owed their status as sciences to the seventeenth century's charac- 
teristic insistence upon experimentation and upon the compilation of natural 
histories, including histories of the crafts." Their quantification and a wider 
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and more thorough mathematisation of physics overall took place during the 
first half of the 19th century and was accompanied by a number of new instru- 
ments, conceptualisations, theories and institutionalisations, which was part 
of what Kuhn described as a second scientific revolution of the sciences. The 
larger question in focus of this chapter is whether the historical sciences is cur- 
rently in such a Kuhnian moment. 

The second relevant area of Kuhn’s research is his more widely known gen- 
eral theory of scientific change that was first presented in Structure of scientific 
revolutions (1962) and that he continued to revise and refine for the remainder 
of his career. Kuhn' theory uses the history of scientific development especially 
during the first scientific revolution from the 15th to the 17th centuries to 
design a theory that outlines how a traditional or ‘normal science’ through a 
scientific revolution transforms into a new science, a radically different para- 
digm of knowledge practice. In this perspective, the response of a scientific 
community to crisis’ in the form of a major epistemological disruption usually 
follows either of two main paths, what can be described as the reintegration and 
domestication of the new disruption as part of the existing framework of tradi- 
tional ‘normal’ science, or the revolutionary transformation of the traditional 
science into a new science. 

Kuhn’s theory of scientific revolutions has been important in not just help- 
ing historians of science conceptualise changes within the natural sciences, 
but also in helping historians in general to better understand change within 
their respective domains. It is difficult to exactly translate Kuhn’s terminology 
to other areas and as I. Bernard Cohen points out, there are many problems 
with using Kuhn, such as that ‘historians and philosophers of science do not 
agree on what constitutes or defines a revolution in science; they do not have 
an objective test for the occurrence of such a revolution; and that ‘there are 
certain kinds of revolutions in science that do not exactly fit Kuhn’s schema." 
Nevertheless, despite these obstacles, several historians have used Kuhn’s con- 
ceptualisations to understand change also within historical disciplines. As 
David Hollinger has pointed out, 'Kuhn' terms have been employed explicitly 
by historians of art, religion, political organization, social thought, and Ameri- 
can foreign policy.” Those historians also include Thomas Kuhn himself, as is 
clear from his remark on an upcoming academic discussion of Martin Bernal’s 
‘Black Athena’ theory of ancient history, when he stated that it ‘was being held 
far too soon and that disciplines did not usually respond so quickly to funda- 
mental challenges." 

Aware of these problems, I use Kuhnian terminologies as ideal types (in a 
Weberian sense) to help me conceptualise the recent past, present and future 
developments within digital historical practice and to outline two major 
responses to the challenges of the new computational digital history, as well 
as sketch a possible methodological middle way navigating between the 
two. This is an extension of previous research of mine where I, as a part of 
an empirical digital history study, identified and outlined what I saw as the 
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major methodological strands within current digital history research. Fol- 
lowing Kuhn, I have described the two main ideal type responses towards the 
new disruptive digital methodologies as them either being domesticated and 
naturalised as part of traditional history, what Kuhn would describe as ‘nor- 
mal science, which I have termed digital history 1.0, or taking the second more 
revolutionary route in the form of a paradigmatic digital history 2.0, radically 
transforming and disciplining the practice of historical research. However, 
as an alternative to these two main routes of conservation or revolution, I also 
outline a potential third ‘middle way’ between the ‘normal’ practice of his- 
torical science and a potentially ‘paradigmatic’ digital history. The overarch- 
ing question is whether the new digital historians will want to transform, and 
succeed in transforming, the historical discipline overall, to break off and form 
a new historical discipline, or whether they prefer to remain part of history's 
‘disciplinary mosaic’.'° 


Our Invisible Digital History 


The digital has already changed historians’ practice so that today ‘all histori- 
ans are already digital’ whether or not they ‘self-identify as digital historians,” 
although perhaps in ways invisible to or at least not reflected upon by most his- 
torians. History is already changed through historians’ everyday use of digital 
tools and materials, something which can roughly be divided into the produc- 
tion, communication, presentation and administration of historical research.'* 
The following description might to some appear trivial, banal or mundane, 
but that should not diminish its importance; on the contrary, this ordinariness 
makes it even more important for understanding the wider impact of the digi- 
tal on the historians’ craft. 

The first and most important influence of digitisation is on historians’ pro- 
duction. Like other office workers, the overwhelming majority of histori- 
ans have since the 1980s been relying on digital computers as their foremost 
research tool. Most importantly, computers are used for writing and note- 
taking and since at least the 1990s also for organising and storing primary and 
secondary digital source materials, often in such portable digital document for- 
mats as photographed, scanned or born-digital images of archival documents, 
texts, photographs, artifacts, journal articles and books. The existence on most 
historians’ computers of hundreds or thousands of files with names ending in 
suffixes such as .doc, .pdf, .xls and jpg provides ample material evidence of 
the impact on historians’ practice from reading, watching, manipulating and 
writing of digital materials. 

Digitisation’s second major impact is on how we historians communicate 
with institutions and individuals that provide access to source materials for 
our research, such as archives and libraries, as well as with other historians and 
non-historical researchers within our research fields. Since the 1990s, emails, 
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mobile and smart phones, text messaging and social media has afforded his- 
torians ever faster and wider communication possibilities. Third, the digital 
has impacted the historians’ practice through making possible their research 
results to be communicated through new digital forms of representations. This 
is through presentations at academic conferences, seminars and talks, primarily 
through much easier and efficient use of digital images, figures and graphs, as 
wellas the increased use of digital presentation software programs such as Power- 
Point, Keynote and Prezi as well as online presentations and meeting using dig- 
ital applications such as Skype and Zoom. In addition, preliminary and finished 
research is routinely presented in the form of digital documents to colleagues, 
conferences and publishers, as conference and seminar papers, manuscripts, 
preprints and offprints of articles, chapters and books. The final way in which 
historians’ research practice has been impacted is with regard to its practical 
organisation and administration, through the various ways in which the digital 
tools and formats described above, together with the internet, have changed the 
possibilities for conducting research more effectively and (mostly if not always) 
with less costs in time and money. This includes all the ways in which we use 
the internet and especially search engines, such as Google, Bing and Baidu, 
to gather practical information about locations, access and opening hours for 
archives, libraries and museums, as well as conducting practical matters such 
as booking travel and buying books, source materials and artifacts through ser- 
vices such as Amazon, eBay and Alibaba, and registering and paying online for 
conferences or memberships in professional organisations. 

This normal everyday digital impact on the historians craft is most often 
invisible. The hidden digital tools and computational algorithms built into these 
various applications enabling our research are probably not much reflected 
upon by most historians, but these concealed tools have enhanced traditional 
history by making it faster, easier and cheaper in money as well as in time and 
energy. However, there are also other domesticated forms of digital methods 
and tools that in more conscious, reflective and visible ways have influenced 
historians’ practice, something which I describe as digital history 1.0. 


Domesticated Normal Science: Digital History 1.0 


By conceptualising various aspects of historians’ practice as digital history 1.0, I 
mean to accentuate that already today many historians, in addition to the invis- 
ible application of digital tools discussed above, have intentionally although 
often without much apparent thought appropriated digital methodologies as a 
part of their standard historical research practice. Digital history 1.0 includes 
how historians have integrated the use of digitally enhanced tools and materi- 
als as a part of their normal research practice, such as digital databases and 
resources such as Google, Wikipedia and JSTOR for digitally augmenting their 
historical research. Such historians might, however, not see themselves as 
doing ‘digital’ history, but just ‘history, as these digital applications have often 


Digital History 1.5 75 


been domesticated and seamlessly incorporated into ‘normal history”. This digi- 
tal ignorance or blindness is a returning complaint of digital historians, with 
statements such that ‘the average historian is at most a passive user of digitised 
sources in which he/she mostly sees a substitute for the material original’ and 
‘carrying out fairly traditional research as if the [digital] resource was not there 
(but hopefully citing it nevertheless)?" 

In the vocabulary of the historian and philosopher of science Thomas Kuhn 
these historians have augmented their ‘normal science’— history —of histori- 
cal research with the use of various forms of digital sources, tools and methods. 
By normal science, Kuhn means the established and dominant scientific tradi- 
tion of conducting research existing within a scientific discipline which ‘often 
suppresses fundamental novelties because they are necessarily subversive of its 
basic commitments?! The ignorant attitude among normal historians referred 
to above can be seen as exemplifying this. Another example is when one digital 
historian complains about traditional historians’ blindness to how the digital has 
changed the historians’ practice, how most historians today ‘combine 
traditional/analogue and new/digital practices, at least in the information gath- 
ering stage of their research. However, ‘reflection is often missing. On more 
than one occasion I have heard historians proclaim to be non-digital, as if this 
were something of which to be proud, while evidently making use of digital 
resources in their research?” Yet another digital historian describes ‘a degree of 
condescension and suspicion towards digital resources from many mainstream 
historians.? These examples could easily be multiplied. 

And still, digital history 1.0 has already visibly changed historians’ practice: 
first, by increasing the number of citations and the diversity of primary sources 
used, as well as a disproportionate use of citations to online sources. One 
example is from Canada, the first country to have two of its major newspapers 
the Toronto Star and the Globe and Mail digitised in 2002. Research on history 
doctoral dissertations uploaded to the ProQuest database between 1997 and 
2010 showed a 99196 increase in citations to the Toronto Star after it had been 
digitised, ‘as opposed to minor increases and even decreases for other newspa- 
pers.? Connected to this, digitisation has also changed how historians think of 
their archives. Traditionally, for most historians, an emblem of becoming a real 
historian and marking something of a rite of passage is to carry out research 
in a physical archive located in a particular (often remote) place where you sit 
and go through dusty and perhaps previously unread pages of primary sources 
in the form of paper documents such as letters, minutes, reports, etc. In the 
digital age, these traditional archives are often supplemented or surpassed by 
online document archives that you can access from your office chair at your 
home institution. But even when the historians do visit physical archives, their 
practice has been changed by the digital in that ‘analytical work is displaced 
from the archives. This is also due to new digital tools, as the 


use of digitized finding aids, digitized collections, and digital cameras 
[that] have altered the way that historians interact with primary sources. 
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While the centrality of archives to the research process remains, the 
nature of interactions with archival materials has changed dramatically 
over time; for many researchers, activities in the archives have become 
more photographic and less analytical.”° 


By changing the possibilities of access to distanced primary materials, the new 
digital resources have transformed history. 

One striking example of how the digital history practices can be transformative 
while almost methodologically invisible comes from the research by historians 
Sónke Neitzel and Harald Welzer on the politics and world view of German 
Second World War soldiers that was based on a previously unused source 
material in British and American archives in the form of several hundred thou- 
sand pages of transcripts of interrogations with German POWs. This ground- 
breaking in-depth research on this ‘mind-boggling amount of material’ was 
only made possible through the use of digital methodologies and was described 
in the following way in their monograph Soldaten: “We were able to digitize all 
of the British documents and most of the American material and sort through 
it with the help of content-recognition software?” This is all that is said. No fur- 
ther words on their digital research methodology such as what software, search 
methods or keywords that were used. The choices made and opportunities cre- 
ated by the digital tools have been made almost totally invisible. 

It appears that Toni Weller is correct in stating that 'for most historians, the 
challenges of the digital age are not ones that are seen to directly concern their 
research’ and that the suggestion by an author commenting on the tenure, pro- 
motion and review process ‘that "learning to use a database, scan materials, and 
query that database all consume time that could be used to write" is probably 
a reasonably accurate reflection of the way the majority of historians perceive 
digital scholarship.” However, there are those historians where the digital is a 
primary methodological focus in their research practice and who are practising 
a more radical form of digital history 2.0* 


Revolutionary Paradigmatic Science: Digital History 2.0 


Some digital historians appear to see digitisation’s ‘profound transformation of 
history as inevitable, in that they state that as ‘datasets expand into the realm 
of the big, computational analysis ceases to be "nice to have" and becomes a 
simple requirement.” This new paradigmatic digital history practice ‘offers 
a stark contrast to what has become standard historical practice? The current 
revolutionary enthusiasm is in some ways reminiscent of digital history’s first 
wave in the 1970s when ‘it looked like history might move wholesale into quan- 
titative histories, with the widespread application of math and statistics to the 
understanding of the past’ and resonate with the past ‘hyperbole that saw com- 
putational history as making more substantial “truth” claims, or the invocation 
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of a “scientific method” of history? The question is whether also the current 
putative computational revolution will live up to the high hopes and hypes or 
if it also will wane to become just another small specialised sub-discipline of 
the historical discipline or that it perhaps will abandon history and emigrate, 
like many of the first generation of digital historians who left the humanities 
for the social sciences and its new, more quantitatively inclined sub-disciplines, 
such as social and economic history. 

The question is whether this new potentially revolutionary historical para- 
digm can be described, in Thomas Kuhn’s words, as the outcome of a scientific 
revolution ‘from which a new tradition of normal science can emerge? Kuhn 
described ‘what all scientific revolutions are about’ in that they 


produced a consequent shift in the problems available for scientific 
scrutiny and in the standards by which the profession determined what 
should count as an admissible problem or as a legitimate problem- 
solution. And each transformed the scientific imagination in ways that 
we shall ultimately need to describe as a transformation of the world 
within which scientific work was done.? 


After a paradigm shift, it is not just what is valued as good research that has 
shifted, but the disciplines core elements are transformed and the field is recon- 
structed 'from new fundamentals, a reconstruction that changes some of the 
fields most elementary theoretical generalizations as well as many of its para- 
digm methods and applications.” What is accomplished in this is the trans- 
formation of the ‘disciplinary matrix' —what is considered as the relevant and 
central methods, significant data, instruments, theory, methods, concepts and 
working practices. Below, some of the major elements of the possible discipli- 
nary matrix of digital history 2.0 will be outlined. 

Digital history 2.0 is taken to represent research practices with a potential to 
form a new digital historical paradigm primarily focused on new quantitative 
and computational methods to undertake text analysis and manipulations and 
visualisations of historical data. Its research systematically use various digital 
applications and quantitative methodologies for big-data text and data mining, 
calculations and visualisations, such as topic modelling, network analysis and 
text and data scraping. Most of these methods necessitate investments in acquir- 
ing expertise in or collaborators skilled in coding and database methodologies. 

Like with paradigm change within the sciences, the new digital history prac- 
tice transforms the existing practice by introducing new focus and altering 
what is valued, making some of the existing ideals and standards less relevant 
or obsolete in favour of new values and concepts salient to the particular char- 
acteristics of the new history. One such new key aspect of the digital history 
2.0 can be described as compression, which characterises methods that allow 
the historian ‘to begin with the complex and winnow it down until a narrative 
emerges from the cacophony of evidence.” This is in contrast to ‘normal history’ 
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where historians, like good detectives, test their merit through expansion: the 
ability to extract complex knowledge from the smallest crumbs of evidence, that 
history has left behind. By tracing the trail of these breadcrumbs, a historian 
might weave together a narrative of the past.” Some historians even question 
whether the digital turn will so much change history's foundational concepts 
to ‘render the word “narrative” too confining for describing what historians 
produce’ and to make historiographies into a ‘more encompassing term.” 

Normal historians prefer to describe the empirical foundations of their con- 
clusions in terms of documents, sources and at times even ‘facts, while the new 
digital historians often prefer to talk about data. Jim Mussell describes perhaps 
the core aspect of the new digital history just in that it ‘requires a change in 
focus from document to data. Data as information, in forms that are able to 
be processed by computers, is central to the new digital history, qualitatively 
as well as quantitatively. Its qualitative effect is the view favouring ‘data to sig- 
nify what counts as the preferred and proper basis for constructing a histori- 
cal argument. The quantitative impact lies in that the new digital texts provide 
copious and often very easily accessible source materials for historians. In 2008, 
a senior digital historian stated with special reference to the recently started 
digitisation efforts by Google Books, online digital image collections and the 
creation of digital newspaper archives that it was ‘now quite clear that histo- 
rians will have to grapple with abundance, not scarcity’ and that ‘nearly every 
day we are confronted with a new digital historical resource of almost unim- 
aginable size. In that sense, history could be seen as having entered the era 
of big data or perhaps better ‘biggish data. How much data it takes to makes 
it ‘big’ has been described as ‘in the eye of the beholder’ in that if ‘there are 
more data than you could conceivably read yourself in a reasonable amount of 
time, or that require computational intervention to make sense of them, it’s big 
enough!“ One example of such big data for historians are the online Old Bailey 
records (www.oldbaileyonline.org), which consist of almost 200,000 criminal 
trials between 1674 and 1913 and 127 million words.? 

The rise of online archival research and the loss of the manual physical han- 
dling of original primary sources is one example of how the material practice of 
the historian is changing in the digital era. Another example of a radically new 
social dimension consists of multidisciplinary teamwork. This might be one of 
the most challenging aspects of the new history to many traditional historians. 
Although many examples exist of co-authored works in history, it is still far 
from the norm, and when it does occur it is rarely with collaborators from 
outside historical disciplines. Another changing practice is a shift to totally 
new activities in that ‘less than 5% of the time spent on a project will be time 
spent analyzing and visualizing data, with the majority ‘spent on collecting, 
cleaning, and interpreting.? Another aspect of the changing historical practice 
is new digital forms of publications as the traditional paper forms of historical 
publications are not seen as 'suited to the fast-changing discourses ofthe digital 
age— demonstrated by the fact that most pure digital history texts tend to be 
in the form of websites, blogs and online articles and journals rather than the 
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traditional historical outlet of the monograph.“ Such new digital forms of pub- 
lications also make possible new dynamic and interactive forms of presenta- 
tions with inclusion of digital sound and video files, as well as scalable images, 
maps and network graphs. 

To conclude this discussion of the changing practice of the new digital history 
practice, I will quote the two computer scientists behind the Culturomics pro- 
ject who also helped to develop the Google Ngram Viewer and when criticised 
for not having included any historians in their project explained it thus: 


Even when we found historians who shared our enthusiasm, there were 
still great barriers to working together. For instance, [a meeting was 
convened with] about a dozen interested history students and faculty. 
The historians who came to the meeting were intelligent, kind, and 
encouraging. But they didn't seem to have a good sense of how to wield 
quantitative data to answer questions, didnt have relevant computa- 
tional skills, and didn't seem to have the time to dedicate to a big multi- 
author collaboration. It's not their fault: these things don't appear to be 
taught or encouraged in history departments right now.” 


In short, history had failed in being willing to work like computer science. 


Semi-Automatic History: Digital History 1.5 


Some digital historians propose a less radical transformation than that prom- 
ised by digital history 2.0, where ‘historians do not need to learn new technolo- 
gies or computer codes; they do not need to become computer scientists: They 
disagree with those advocating a revolutionary transformation of the histori- 
cal practice and argue that a part of ‘the problem thus far has been too much 
emphasis on historians becoming something they are not; to the detriment 
of the fundamental skills and expertise that is the craft of the historian.“ The 
real challenge lies, such historians argue, ‘in persuading the vast majority of 
historians of the benefit of even relatively simple information technology, not 
in developing specialist historical tools and methods that would only ever be 
of relevance to a minority of historians." Some like Gerben Zaagsma want to 
go somewhat further and consider that the 'real challenge is to be consciously 
hybrid and to integrate "traditional" and “digital” approaches in a new practice 
of doing history.“ ‘Digital history 1.5’ aligns itself with such views and can be 
described as an acknowledged and reflective digital history ‘without the pro- 
gramming’ that consist of the use of semi-automatic historical methodologies 
in between normalised ‘digital history 1.0' and paradigmatic digital history 2.0' 
research methods.” 

Digital history 1.5 is a hybrid or mixed methodology in that it is a combi- 
nation of quantitative and qualitative historical research methodologies, and 
semi-automatic as it combines a large amount of manual evaluation with the 
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systematic use of automatic analysis vested in pre-programmed offline and 
online calculation and visualisation applications and tools using digital text 
and databases, such as Google Books, Early English Books Online (EEBO) and 
digitised historical newspaper archives. That this digital history is without pro- 
gramming is of course not absolutely true in that it does use digital applications 
based on a lot of computer code and many mathematical algorithms, but this 
coding and programming is invisible as it is pre-packaged in the various appli- 
cations and tools: it is ‘black-boxed’ to the historian user.” 

What differentiates digital history 1.5 from digital history 1.0 is that it consists 
of a systematic use of digital tools and sources where the digital methodol- 
ogy is the central method enabling the investigation. Furthermore, it incorpo- 
rates a conscious reflexivity about the digital sources, resources and methods 
used in the investigation and is being reflective about its respective strengths 
and weaknesses. At the same time, it is not ‘digital history 2.0' in that in its 
investigation it is using pre-programmed applications and resources without 
any additional coding of software, advanced programming of applications or 
tuning of digital techniques and methodologies. Some specific digital history 
1.5 methodologies are semi-automatic text extraction and presentation, which 
combine quantitative computer-enabled ‘distant reading’ of big data digital text 
corpora and qualitative ‘close reading’ of extracted individual texts.” This takes 
the use of semi-automatically extracted and processed databases where the 
individual texts can be newspaper and journal articles that could be collected 
using various online search interfaces such as those that exist at various online 
newspaper and journal archives. 

To conclude this treatment of the hybrid practice of digital history 1.5, two 
of its central methodological elements will be conceptualised. This is inspired 
by Ted Underwood’s article “Theorizing research practices we forgot to theorize 
twenty years ago, which argues the need for digital humanists to ‘think more 
rigorously and deliberately about existing practices.” The first central element 
is its key technology, as well as a central engine of the potential digital history 
revolution, in the form of the search engine. One problem with talking about 
‘search’ for digital historians is that it is, as Underwood states, ‘a deceptively 
modest name for a complex technology that has come to play an evidentiary 
role in scholarship.? By ‘search is meant the algorithmic mining of large elec- 
tronic databases that since the 1990s has been used by humanists. Furthermore, 
the term ‘search only points to its use as a finding tool and leaves out its wider 
methodological implications and—echoing digital historians’ criticism of tra- 
ditional historians’ negligence of their digital tools as discussed above—that the 
‘scholarly consequences of search practices are difficult to assess, since schol- 
ars tend to suppress description of their own discovery process in published 
work.™ Therefore, as a way of contributing to digital history's conceptual devel- 
opment and to make the existing digital history methodologies more explicit 
and reflective, I have elsewhere described and named an already existing quali- 
quantitative digital history methodology. I thus proposed the term readsearch 
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for the methodology of using online keyword searches as being ‘a new hybrid 
concept denoting a quali-quantitative methodology combining targeted close 
manual and machine distant reading through the use of search engines on large 
digital text corpora.” 

Furthermore, I have attempted to further explicate the various forms of read- 
search methodologies and problematise the use of search for research. Taking 
inspiration from Underwood, who explains that ‘a full-text search is often a 
Boolean fishing expedition for a set of documents that may or may not exist,°° 
and in line with this I differentiate between different readsearch methodolo- 
gies by categorising them into three main forms: spearfishing, angle and trawl 
readsearch. ‘Spearfishing readsearch’ designates a form of search consisting of 
browsing through a large text corpora close to what can be described as ‘online 
microfilm browsing, in that the search interface is using various keywords or 
dates to focus the search, but at the same time allow the reader to immerse 
him- or herself in the text until he or she comes across any relevant findings. 
When using ‘angle readsearch; the researcher searches for texts referring to one 
specific unique event, person or place and thus like an angler adapts the angles 
(the search terms) to tailor them for best catching a particular fish (an event or 
entity). Finally, in the use of ‘trawl readsearch; the search is used to find many 
hits of a general term, word or phenomena and this is the form of readsearch 
where the distant machine reading plays the largest part. Like when fishing 
using a trawl, this is a combination of machine and manual reading. After a 
large fishing trawler makes a catch in its trawl, it hoists it up and empties the 
catch onto the vessel and then manually goes through the catch to sort out and 
‘throw back the unwanted catch: fish of the wrong species or too small to mat- 
ter, as well as garbage caught up in the trawl. Similarly, the texts found through 
a search's machine reading is in a trawl readsearch examined manually to sort 
out the valuable and searched-for texts. This is a methodology especially used 
when tracing the change of a concept or a term over time. Some readers might 
find these methodological neologisms as too idiosyncratic to be meaningful 
and whether digital historians in the future will follow in adapting the specific 
readsearch terminologies is of less importance. What is crucial for them to fol- 
low, however, is in reflecting on their digital epistemology, what their use of 
digital methods does to the historical knowledge being produced and to explic- 
itly conceptualise and theorise their practice as historians using digital tools 
and resources. 

The second main element of digital history 1.5 connects to historian Andreas 
Fickers claims that as a response to the salience of the new digital sources, the 
discipline of history needs ‘a new digital historicism. This historicism should be 
'characterized by collaboration between archivists, computer scientists, histo- 
rians and the public, with the aim of developing tools for a new digital source 
criticism. Along with many digital historians, I would add to this the need 
for a digital resource criticism that extends historians’ critical faculties to the 
digital resources they use, such as the search engines, algorithms, programs 
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and applications. Overall, a digital historian ‘requires a more advanced under- 
standing of the affordances of the digital in order to perform more advanced 
research!” Historians, like most users of digital technologies, use technology 
‘without reflection, without understanding how it actually works’ and thus 
need to develop a new digital reflexivity. Like historians are trained to consider 
and look for the contextual and authorial biases of our historical sources we need 
to think about ‘the worldviews built into our tools; as too often we tend to 
forget 'that our digital helpers are full of "theory" and "judgement" already. As 
with any methodology, they rely on sets of assumptions, models, and strategies. 
Theory is already at work on the most basic level when it comes to defining 
units of analysis, algorithms, and visualization procedures.“ In doing this, the 
traditional skills of historians 


are still necessary, but the focus on practice—on doing things with 
data— extends their application, forcing a recognition of the constructed 
nature of evidence and its relation to the absent past. Necessarily spec- 
ulative, the historian must bring his or her expertise to bear on these 
digital environments and evaluate the plausibility of what they both 
embody and imply.” 


When we historians start ‘to think digitally, we can gain a better understand- 
ing of the underlying mechanisms, algorithms, programmed omissions and 
choices of our digital tools and allow the historian ‘to be a better critic, a better 


consumer of digital data, a better user, and thus a better historian. 


Conclusions: Business as Usual or Going Fully Digital? 


This chapter has in many ways gone against historians’ normal practice. Instead 
of trying to see the patterns and causes of past events after the dust has settled 
it has tried to discern the contours of emerging phenomena and to conjecture 
about possible future outcomes. This it has done to try to better understand 
which way or ways history will take in our ever increasing digital age. Will it be 
the old-trodden one or a new and radically different path? This has been a nec- 
essarily speculative exposition of three routes for digital historians that could 
be summarised as unreflective normalisation, paradigmatic transformation 
and reflective appropriation. In this, it has tried to point to the third middle 
way as a wider route for historians who are neither satisfied with just continu- 
ing with their historical “business as usual’ by staying agnostic about its already 
existing digital methodological dimensions nor prepared to join the specialised 
minority of historians who will go fully digital by learning to code or enter 
into collaborations with computer and information scientists. In this, I align 
myself with previous digital historians, such as Toni Weller, who have argued 
that ‘part of the “them and us” problem thus far has been too much emphasis 
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on historians becoming something they are not; to the detriment of the funda- 
mental skills and expertise that is the craft of the historian. 

To conclude, let us return to Thomas Kuhn and take some solace from his 
statements ‘that there can be small revolutions as well as large ones, that some 
revolutions affect only the members of a professional subspecialty’ and on 
rare occasions ‘two paradigms can coexist peacefully.® Furthermore, history 
teaches us that revolutions, scientific as well as political, always come at a cost 
and bring losses as well as benefits, such as in 


the transition from an earlier to a later theory, there is very often a loss 
as well as a gain of explanatory power. Newton's theory of planetary and 
projectile motion was fought vehemently for more than a generation 
because, unlike its main competitors, it demanded the [conceptual] 
introduction of an inexplicable force that acted directly upon bodies 
at a distance. Cartesian theory, for example, had attempted to explain 
gravity [mechanically] in terms of the direct collisions between elemen- 
tary particles. To accept Newton meant to abandon the possibility of any 
such explanation.“ 


However, although the new ways of understanding the world were triumphant, 
‘the price of victory was the abandonment of an old and partly achieved goal. 
For eighteenth-century Newtonians it gradually became “unscientific” to ask 
for the cause of gravity.” The task ahead for us historians is to make sure that, 
whoever will succeed in shaping the apparently inevitable further digitisation 
of the historical discipline, into a domesticated or revolutionary historical prac- 
tice or something in between, that history’s rewards outweigh its losses. 


Notes 


! Graham et al. 2015: 35. 

? Weller 2013b: 1; Graham et al. 2015: 35. William Cronon in 2012 as Presi- 
dent of the American Historical Association said that he 'increasingly 
believe[s] that the digital revolution is yielding transformations so pro- 
found that their nearest parallel is to Gutenberg’s invention of moveable 
type (see Cronon 2012). 

3 Weller 2013a: 195. 

^ Zaagsma 2013: 24; Weller 2013a: 195. 

5 Fridlund 2017; Fridlund & La Mela 2019. 

$ Schumpeter 1954: 42. 

? Kuhn 1961: 162. 

8 Ibid.: 161. 

? Ibid.: 190. 

10 Ibid.: 185. 
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! Ibid.: 186, emphasis in the original. 

2 Cohen 1987: 24, 31. 

? Hollinger 1989: 108. 

^ Bernal 1991: xix. 

5 My distinction between digital history 1.0 and 2.0 is similar to but more 
general than that of Jim Mussell, who primarily discusses changing digital 
history practice in relation to the digitisation of source materials. See 
Mussell 2013: 80-91. 

16 Graham et al. 2015: 4. 

7 Ibid.: xvii. 

35 This description focuses on the historian as a researcher and does not 
include changes to the historian's practice as a teacher, administrator or 
public historian. 

? Besides using ‘invisible’ domesticated digital tools such as word process- 
ing, email, search engines and electronic articles, pictures and documents 
in their normal professional research practice. 

2 Zaagsma 2013: 18; Mussell 2013: 90. 

?' Kuhn 1970: 5. 

? Zaagsma 2013: 17. 

? Weller 2013b: 4. 

^ Bilansky 2017: 517. 

3 Graham et al. 2015: 48. 

% Rutner & Schonfeld 2014: 8. 

” Neitzel & Welzer 2012: ix-x. 

°8 Weller 2013b: 3. 

? Graham et al. 2015: 4. 

* Ibid.: 1. 

?! Ibid.: 23. 

? Kuhn 1970: 84. 

? Ibid.: 6-7. 

** Ibid.: 85. 

Graham et al. 2015: 2. 

36 [bid.: 1, emphasis added. 

7 Ibid.: 32. 

8 Mussell 2013: 81. 

? Daniel J. Cohen in Cohen et al. 2008: 455. Cohen was echoing and answer- 
ing the question posed in 2003 by his digital history predecessor Roy 
Rosenzweig in an article entitled ‘Scarcity or abundance?* 

^ Graham et al. 2015: 264. 

41 [bid.: 3. 

? Hitchcock et al. 2012. 

? Graham et al. 2015: 235. 

^ Weller 2013b: 4. 

^ Aiden & Michel 2011. 
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4 Weller 2013b: 1. 

4” Anderson 2008. 

^ Zaagsma 2013: 17. 

^ My designation of digital history 1.5 and 2.0 is close to what Zaagsma 
describes as ‘plain IT’ and enhanced IT’ respectively (see Zaagsma 2013: 
12). 

5 Fridlund 2017; Fridlund & La Mela 2019: 12. 

>! Moretti 2000; Moretti 2005; Moretti 2013. 

2 Underwood 2014: 64. 

> Ibid. 

> Ibid.: 65. 

> Fridlund & La Mela 2019: 13. This is similar to ‘critical search’ as described 
by Jo Guldi (see Guldi 2018). 

°° Underwood 2014: 64. 

” Fickers 2012: 26. 

** Mussell 2013: 91. 

° Graham et al. 2015: 54. 

© Rieder & Röhle 2012: 70. 

61 Mussell 2013: 91. 

9? Graham et al. 2015: 267. 

$5 Weller 2013a: 195. 

6 Kuhn 1970: 49. 

$5 Ibid.: xi. 

6 Kuhn 1961: 184. 

$7 [bid. 
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CHAPTER 5 


Building Historical Knowledge 
Byte by Byte 


Infrastructures and Data Management in 
Modern Scholarship 


Jessica Parland-von Essen 


Introduction 


Historians are very good at source criticism, but in the digital era this requires 
good provenance data. Historians should also step up to the demand for 
transparency and open scholarship that comes with digital humanities. 
Research and knowledge has to be well documented and reliable. This means 
we need good data management, but also better and more integrated services 
and infrastructures. 

Despite often exceptionally rich descriptive metadata in the cultural herit- 
age sector, research life cycle data management is not easy and finding sources 
might be difficult due to questions of metadata formats or granularity of publi- 
cation. The humanists’ workflow and practices regarding use of sources is often 
hybrid and only partly digital.’ In this chapter, I will analyse different digital 
data types and infrastructures from the point of view of a historian and discuss 
the needs of historical research and knowledge creation. Questions about data 
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management and information structures are important to solve, so that it is 
possible to formulate service needs and user stories for historical research data 
services. I will propose a model for planning research data management and 
data publication for historians. The chapter focuses on the Finnish research 
sector, but includes relevant international infrastructures and initiatives. 


The Concept of FAIR Data 


FAIR data was minted as a concept in an expert meeting among science data 
experts, and resulted in a seminal article on research data management pub- 
lished in 2016.? The concept, which was a more than needed completion to the 
Open Science, Open Access and Open Data rhetoric, won immediate appro- 
bation within the European Union and other data-aware stakeholders. It was 
obvious that open data or access was not by far enough to solve the issues with 
science reproducibility, let alone the efficiency goals of the Digital Single Mar- 
ket. Data cannot always be open and there were other, more technical hurdles, 
too. Data needed to better managed, and the money invested in research should 
not be wasted by sloppy planning. To make the most of our data, it has to be 
organised and taken well care of. Only then can we combine datasets and build 
digital knowledge by linking publications and data in sustainable and trust- 
worthy ways. 

In short, the FAIR principles state that data should be Findable, Accessible, 
Interoperable and Re-usable. It turns out that these fine words in practice result 
in very technical definitions. When going into details, we soon exceed the level 
most scholars in the humanities should have to be bothered with. We should 
simply have workflows and services that support these principles, but for that 
to happen, all stakeholders have to raise their awareness and understand what 
is necessary to accomplish regarding services and infrastructures. 

Let’s take a short look at the principles and how they could be translated 
into a relevant form for our purposes. F stands for Findable. What this actually 
means is machine-readable. The amount of data today is so immense that it is 
important that computers cannot only sort out data, but also act upon it and 
find what is really relevant. This means, for instance, not only that digitising 
text so that it is only in image form is not sufficient, but also that the content 
of text needs to be organised in more specific, semantic ways: it requires struc- 
tured metadata and keywords, as well as common and persistent identifiers 
for concepts like persons or place names. Furthermore, the metadata has to be 
available for and utilised by different kinds of indexing and search tools. 

A means Accessible. This, in practice, today means data that can be down- 
loaded over the web, or at least the internet. Both machines and humans should 
be able to understand the information the data represents or contains, and it 
should not be transferred or changed in non-transparent or undocumented. 
ways. I, as in Interoperable, is a tough one. It means you should be able to com- 
bine datasets and copy metadata smoothly, without losing any information. 
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This means you should comply with existing standards and formats. As research 
data management in many ways is in its infancy and the information systems 
are still largely insufficient or impractical, this is difficult. It is necessary to bal- 
ance the needs of the research and serve the actual research use, which must 
be prioritised. Unfortunately, many researchers are inclined to think that their 
data is far more different and unique than it actually is or needs to be. Usually, 
it is possible to find some aspect of the data that somehow relates to something 
else, be it source, structure or some semantics of the content. As people tend 
to understand how much effort they have put into their own work and devel- 
opment, it is too easy to underestimate the value of other people’s work. The 
not invented here syndrome’ can easily trump real creative openings and slow 
down research. Particularly in the life sciences, there have been many impor- 
tant insights and tools developed (bioinformatics might be the oldest domain- 
specific field within research data management). We should copy as much and 
as fast as we can from other, successful domains. 

R, which is Re-usable data, means that it has a functioning licence or rights 
statement, but also that it has been thoroughly documented so that another 
researcher, or the composer of the dataset in 10 years for that matter, can take 
a dataset and use it again. Often, researchers spend up to 80% of their time 
creating or cleaning their data.* Therefore, careless documentation can be con- 
sidered an inexcusable waste of resources and time. 

The utmost goal, besides efficiency, is of course trustworthy, high qual- 
ity research. The digital environment has the unfortunate quality of being 
simultaneously dynamic and unreliable. Links, even in scientific publications, 
tend to break? This phenomenon is called link rot. Similarly, the content 
behind the link might change in a devious, unnoticeable way, which is called 
content drift. To address this problem, one of the main building blocks of FAIR 
data are persistent identifiers. Above, I mentioned identifiers for different kinds 
of concepts, which makes it easy to trace and link information. Researchers 
might have their own identifiers in the form of an ORCID, which is personal, 
unique and resists changes in name form or affiliation, and makes it possible 
to differentiate people with the same or similar names. Correspondingly, the 
datasets and articles should have their own identifiers, a URN or a DOL, which 
makes citing clear and unambiguous. The point is then the persistence; namely, 
the sustainability of this identifier. This means that we need platforms and ser- 
vices that provide and manage them on a long-term basis. This has a direct 
connection to the importance of infrastructure, which I will address later in 
this text. 

To a historian, it is obvious that one has to address problems of sustainability 
in the long-term perspective, as well as that the sources need to be well docu- 
mented. Are there other means for evaluating the trustworthiness or suitability 
of the data for our needs? Or to ensure that the data are authentic and have 
maintained their integrity? We need to know who said what, where and when. 
Simultaneously, we also need to accept that our own research outputs should meet 
these requirements. 
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The example of citations, the ultimate goals and tests for the data, demon- 
strates well the problem of sustainability. We should ask ourselves how can I 
cite (link to) my (digital) source in a persistent and unambiguous way and how 
can someone else cite the data I have created? There are recommendations for 
this, but they are not obviously sufficient or easy to implement. The national 
Finnish guideline for citing research data offers principles for citing a dataset, 
but how to cite more dynamic resources and what to do* when the resource 
does not provide identifiers or possibilities to download or save (partial) snap- 
shots? Or even if the researchers manage to download the needed data, where 
do they archive it conveniently? The questions of data management during 
research are inescapable for all these practical, technical reasons. However, data 
management is even more complex for historians, because of questions about 
personal data regulation, ethical issues and copyright. 


The Historian's Data Life Cycle 


In Finland, the government and major research funders have promptly adopted 
the Open Science ideology, and research data was included in the policies at 
an early state.” There has been quite extensive work done on a national level 
regarding services, formats and recommendations. In parallel, there has been 
an effort for interoperability and digital preservation within the cultural herit- 
age sector. This has produced services like the search portal Finna.fi and the 
national preservation services.’ These and their future development are of 
course both important from a historian's point of view. Still, the situation for 
research data is quite different, since research data does not come with a clear 
legislation, accountability and centuries-old tradition of long-term or even 
short-term management. Responsibilities are often unclear when it comes to 
both rights and costs. In the humanities, researchers are used to expecting free 
or subsidised services when it comes to sources and information management. 
On the other hand, the research outputs are clearly considered to be the prop- 
erty of the researcher, at least concerning copyright. The work within humani- 
ties is considered creative and personal and thus often falls under intellectual 
property rights legislation. 

The problem is, of course, that ownership is not a simple concept when it 
comes to digital resources. There are many kinds of rights and responsibilities 
entailed in ‘owning’: who has the right to access, copy, use, give access, agree 
on use, alter or destroy a dataset? Who has the responsibility to keep the plat- 
forms running, create metadata, plan for migrations, manage access for the 
next decades and curate the metadata or data if errors are found? It sometimes 
seems that some believe that the researcher herself should have all the rights 
with no responsibilities, even after the research has ended. This obviously does 
not work. There has to be an agreement and a balance in responsibilities and 
rights management. The researcher might have to give up some of the control 
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Figure 5.1: The DataONE data life cycle. Source: Author. 


of her data in return for someone taking care of it. This calls for trust from both 
parties and concordance on common interest and explicit agreements. This is 
usually not a problem, but problems tend to arise from insufficient research 
data management planning. The agreeing would best be done in advance, pref- 
erably not by dictates from one party or the other, but by joint interests which 
should be easy to identify. Since the historian rarely makes up the data, but 
refines existing digital or non-digital data, there are usually concerns that need 
to be taken into account already when the data is created. Therefore, the data 
management life cycle always starts with planning. 

There are different interpretations of the research data life cycle, but gen- 
erally they tend to be variations of models that reflect the traditional way of 
understanding how the research process works in theory (see Figure 5.1). The 
idea is that there is always a project and one or several funders. Although often 
presented as a circular, never-ending process, one premise seems to be linear 
progression of the research process, as well as of science and knowledge build- 
ing. This is, as any historian or other researcher knows, obviously a construct 
that nicely resonates with the way in which scientific publication traditionally 
works, with outputs that are corresponding, constructed narrations about the 
research process. The reality is much more chaotic and unorganised, which any 
data librarian will also willingly admit. The traditional publishing comprises 
snapshots, reports frozen in time, documenting what has been done, for dis- 
semination and future reference. Still, these knowledge bytes are cumbersome, 
ambiguous and digitally discrete from the sources. 

Thus, the single byte of new knowledge has actually been quite open for 
future interpretation, often difficult to spot and point to. Even though the 
novelty might be a new interpretation or insight, there might also be included 
other new information or ‘factoids, all of which become buried within an 
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extensive narration impenetrable for computers. Much of the information 
now being digital, there might be an opportunity to critically assess how we 
communicate our knowledge and are open to transformation within scientific 
publishing. Often, digitisation has meant diversification as well as convergence.? 
When we now bring data into the world of publication, there are immense pos- 
sibilities for opening the whole process, enhancing documentation and sharing 
knowledge in new ways. 

The historian has to decide upon how many of the sources can or should be 
linked to, in other words how many should be digitised, if the sources aren't 
digital and how digitisation should be achieved. Or perhaps the links are all 
external, linking to existing trustworthy digital sources? Data collection and 
creation is more complex in digital humanities than in traditional humani- 
ties research, since questions of documenting provenance and deciding on 
data and metadata formats will affect the research in profound ways. There are 
some cases where established standards exist, like TEI (xml format by the Text 
Encoding Initiative) for encoding text. But TEI in itself will not solve problems 
of interoperability on a deeper semantic level. It would, for instance, always be 
advisable to use good external references as identifiers for all concepts whenever 
possible. Also, the plan for publication might set limits to what the research- 
ers should do, since the platform they choose might have some bearing on the 
formats, metadata and granularity of the publication. If the researchers use 
other people's digital resources (OPEDAS or Other People’s Existing Data and 
Services, as named by a leading FAIR data expert Barend Mons”), they obvi- 
ously need to find out extensive information about them, not only the technical 
and historical provenance, but also about how the data is structured and coded. 
Often, a historian uses OPEDAS created not by researchers, but by heritage 
institutions. As the use context changes, the data provider institutions generally 
do not have readymade generic solutions for managing and publishing research 
data, especially when it is produced by outsiders. 

One of the unfortunate traits of the traditional data life cycle model is that 
publication turns up as a distinct step in a specific and late stage of the pro- 
cess. This hides the fact that the most efficient and impactful way of doing 
research might be doing it transparently all the way. Since this both forces 
the researcher to implement some type of data management and opens up for 
collaboration and spotting quality issues at early stages, this can accelerate the 
work and enhance the quality of the research. After publishing raw versions of 
data, unforeseen help can turn up, when colleagues become aware of what the 
researcher is doing. Close collaborations have not always been an option in 
historical research, which carries the heavy burden of romantic lonely genius 
syndrome, but luckily times are changing. Stealing other people's ideas and data 
is not the first thing most researchers think of. Rather, by publishing raw 
data, the researchers can get their work registered at an early stage, instead of 
waiting for the final peer review. Better collaborating and coordinating than 
working in silence. 
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Version control is the next crucial aspect of the data cycle. If you ask an archi- 
vist, they would probably want to save every version of everything. Even worse, 
this might mean not just saving the information you need to recreate the needed 
version of a dataset, but saving complete copies of each version, independent 
of all redundancy that would create. Version control is generally not that well 
developed in traditional archives. However, every version that is published 
needs metadata and, preferably, a persistent identifier. But this does not mean 
that the researchers have to save everything, every single byte. The researchers 
simply have to be sure that the dataset can be presented in an exactly identical 
form when asked for at a later point in time. In case somebody made a citation 
or important conclusion based on it, it should be possible to reconstruct what 
has happened. It is very important to be clear about it, if this is something the 
researchers do not commit to, when they publish data. 

Managing research data is not the same thing as archiving it, and handling 
digital data requires a somewhat different approach. Here, storage and data 
management are relevant components building trustworthiness of the docu- 
mentation. Citation is one of the main functions of persistent identifiers in 
research. The researchers should be mindful creating them though, since every 
persistent identifier is a commitment to manage the dataset or at least its meta- 
data forever. It will cost somebody a substantial amount of effort and work. 
And even if the dataset is deleted, a tombstone page should be maintained. 
Here, the well-managed research infrastructures and data services come into 
the picture as essential supporters of research. 

Generally, one could consider there to be three different types of datasets 
that are relevant for historians (see Figure 5.2). First, there is the master data 
produced and often published by government institutions, like the cultural her- 
itage data. Unfortunately, it is not always well versioned or documented (red 
in Figure 5.2). It could be data of any kind for any use, but it might be relevant 
for a historical research question due to a long time series or for some other 
reason. Second, there are generic research datasets, which are produced by 
researchers for scientific use (green). Here, you find datasets like corpuses or 
some of the surveys published by the Finnish data archive. Much data of this 
kind can also be found, for instance, with the National Institute of Health or 
other domain-specific research institutes or government bodies. These datasets 
are validated and often cumulative. The third type of research data is a research 
output, created to underpin a specific study or article (blue). These data need to 
be saved, albeit the interest for reuse might be minute, for the simple reasons of 
reproducibility of the research and merit for the creator. 

The historian often finds her digital sources within the first or second cat- 
egory of data. But as she proceeds with her work, the question of publishing 
second- or third-type data becomes increasingly pressing. Now, there is no 
single clear path to publishing this kind of data, which is often a derivate of 
cultural heritage data. Additionally, researchers within the humanities many 
times deal with sensitive data or data under copyright, which makes storing 
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Figure 5.2: Main types of data used by historians and how those are interre- 
lated. Source: Author. 


and publishing even more difficult. I will discuss the options later in this chap- 
ter when discussing research infrastructures. 

There is often more to a digital humanities research outcome than just a 
dataset and a result explained in a narrative form. It needs to be pointed out 
that the historian often handles a double narration: one of the research process 
and then another, which is the actual new knowledge. This is the normal 
situation when carrying out gualitative research or being unable to present or 
refer to the actual tools and methods used. However, when using computers 
and computational methods, the process and outcomes like dynamic databases 
or visualisations could and should also be included in the outputs, in addition 
to information about the sources or actual data. For this, the existing solutions 
are few and the methodology is very thin. Preservation of databases has devel- 
oped somewhat, but documentation and preservation of dynamic user inter- 
faces and other kinds of complex code is still in its cradle. It is well known that 
they need an extensive amount of curation to be kept usable for more than 
some years. This means that they are both risky and costly to preserve. Still, 
some effort to save these is better than just abandoning digital projects at the 
end of project funding. The problem is usually to find the party willing to take 
the responsibility. Therefore, this is also one thing that best would be solved 
at the point of planning the research. 
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Source Criticism and Research Assessment 


Assessing digital sources requires a substantial amount of metadata. I need 
to discuss this theme more closely to explain why and how data management 
planning and infrastructures are relevant not only for creating FAIR data, but 
also for carrying out high quality research in history in a digital environment. 

A digital document does not have an ‘original’ copy. Instead, it is recreated 
every time the source is rendered from a digital file consisting of zeroes and 
ones. Everything is just copy, while the analogue versions, which are the ones 
we can perceive with our senses on the screen or in our ears, are generated by 
software and hardware that have a decisive effect on what they actually appre- 
hend. The calibration of the screen or the sampling frequency of an audio file 
might affect how one interprets what is represented or real. In cases where a 
physical original exists, one can always check it, but if the source is born-digital, 
this becomes impossible. Therefore, there is a need for technical metadata. 

The best way to evaluate the trustworthiness of a digital source, as is com- 
monplace for the historian, is to check its provenance. In practice, the research- 
ers need to assess the organisation or person who has delivered the source. 
Can they show documentation about the technical and administrative life cycle 
of the source? Do they comply with the Open Archival Information System 
(OAIS) standard or do they have other certificates for digital preservation?!’ Do 
they use and manage persistent identifiers that are globally unique and persis- 
tent? Can they present extensive metadata, including checksums? The check- 
sums are important digital seals for calculating the integrity of files, but they do 
not work across file formats, which is why the researchers need to have a good 
trail of documentation and management of persistent identifiers. The formats 
might have changed during the life cycle of the data. What else has happened in 
terms of migrations, curation, cleaning and enhancing the data? Is everything 
convincingly documented? 

There are several kinds of metadata. To be able to represent a digital source 
in a similar or corresponding way we need technical and structural metadata 
that helps one choose the right tools and understand possible offset. We also 
need administrative metadata that informs about the rights and responsibilities 
attached to the data. Furthermore, we need descriptive metadata, which helps 
with finding and organising the data, as well as with the usual historical source 
criticism around what, who, when, why and other contextual information. 
This is the part of information that is most threatened in research data, due to 
reasons of personal data. Data archives often prefer anonymised data, which 
means crucial historical information is permanently lost from the historian’s 
point of view. This is also the reason why the current research data archives do 
not provide sufficient services for many historians. The organisations that 
do this best are institutions like the Swedish and Finnish literary societies, 
which have a profound understanding of the importance of the personal and 
unique as part of the greater whole and of the research processes within cultural 
studies and history. 
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Itis important to understand the ephemeral nature of digital information, not 
only when it comes to the historian's own sources, but also concerning working 
with data. If research is to be possible to repeat, the digital operations under- 
taken should be well documented. Code should be documented and saved and 
versions of the dataset have to be managed. Not everything has to be saved, but 
one should consider versioning and documentation when significant changes 
are made." Conversions, cleaning and mapping need to be accounted for, since 
they may affect the outcome of the research. And as technologies become obso- 
lete over time, all types of metadata are necessary. Otherwise preservation will 
not be possible." This part ofthe data management should be planned together 
with data librarians and professional data stewards. 


Infrastructure and Services 


Reliable and good quality research craves good citations and linking. The 
historian's digital sources can be found in cultural heritage institutions or in 
many other places that sometimes, but not always, offer possibilities to create 
FAIR data by pointing to the sources in exact and sustainable ways. Often, 
the researcher needs to clean and organise the data, which in turn creates a 
new dataset. 

According to the European Commission, research infrastructures (RIs) 
are facilities, resources and services used by the science community to 
conduct research and foster innovation. The Finnish Academy is lengthier in 
its definition: 


Research infrastructures refer to a reserve of instruments, equipment, 
information networks, databases, materials and services enabling 
research at various stages. Research infrastructures may be based at 
a single location (single-sited), scattered across several sites (distrib- 
uted), or provided via a virtual platform (virtual). They can also form 
mutually complementary wholes and networks. Europe hosts several 
large-scale research infrastructures that are open to collaborative use 
across national boundaries. 


The Open Science and Research Initiative report addressed RIs." This report 
distinguished between services, data and equipment. This classification has 
also been implemented in the national Research Infrastructure catalogue, 
which provides persistent identifiers for these (https://research.fi/).'^ Many 
infrastructures provide two or three of these types of resources. The national 
strategy for RIs" demonstrates that we have advanced infrastructures for lin- 
guistics, register research and social sciences. The national consortium for sup- 
plying digital publications for the research libraries within all domains is also, 
for some reason, considered a humanities and social sciences infrastructure. 
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The cultural heritage sector is left out, except for the shared search portal Finna, 
which serves the public as well as the research community at large when it 
comes to traditional research publications (namely, articles and monographs). 
This means the search portal aggregates some relevant research data for his- 
torians, like individual photographs or archival collections, research dataset 
metadata from the Data archive (a patchwork with very few internal links and 
of highly varied granularity) and research literature from all fields. The Euro- 
pean cultural heritage portal Europeana does the same, except leaving out the 
literature and focusing on the traditional but digitised sources. 

The main problem is, besides missing sufficient persistent identifier man- 
agement, the lacking information structures. The digital objects vary in size 
from single photographs to archival collections and corpuses with almost non- 
existing descriptive metadata from a historian’s point of view. Saying this, I do 
not want to belittle the enormous and important work that has been done to 
bring all this metadata together. It has been an extremely valuable effort, with 
thorough implications for the cultural heritage sector in Finland, which has 
taken huge steps towards openness and digitisation. However, for research 
we still require better representations of the sources and their internal rela- 
tions. Important digital sources are omitted, including databases provided by 
the institutions themselves, not to mention historical research databases else- 
where, whose producers often face great difficulties getting hold of sustainable 
funding or sufficient data management for their digital research outputs. The 
cataloguing of these resources, documentation and linking datasets derived 
from cultural heritage data in general is today left to the researcher, who gener- 
ally has few possibilities to maintain these after the funding ends. Today, the 
historian most often has to be content with publishing discrete research data- 
sets as simple files, which have weak and only human-readable links to other 
digital resources. Also, the reuse value is less than it probably would have to be, 
due to this approach and meager machine-readable semantics. 

Both the Language Bank of Finland and the data archive have juridical man- 
dates to store this kind of data, but the researcher has an extensive responsibil- 
ity too. The slightest flaw in consents or rights questions easily becomes an 
insurmountable hurdle for archiving or sharing the data. There are also reasons 
to question whether this kind of publishing is the one and only, or whether 
there could be more suitable platforms or structures than the currently avail- 
able solutions. 

Digital media are not only unstable and diverse, but they are also often more 
disposed for interactivity and a dynamic communication that happens in dia- 
logue, even co-creation with the readers/users.'* In fact, it might be a mistake 
not to consider this kind of publishing and knowledge creation in a research 
domain that is so relevant and open to popularisation and popular culture. Dif- 
ferent kinds of map and wiki applications can be used for sharing historical 
knowledge. Wikis are especially suitable due to their very transparent and clear 
version management. They also enable very good structuring and linking of 


100 Digital Histories 


data.” In fact, the wiki technology combined with careful data management 
would offer an almost out-of-the-box solution for FAIR data. 

The historian needs to carefully plan her data management. Questions of 
personal data, consent and copyright need to be addressed at an early stage 
before even starting the research. This does not mean that one has to decide 
on every detail or stick to the plan whatever happens. In fact, the opposite is 
often true: the plans have to be modified or redone, when new issues arise. 
The research process in digital humanities is often iterative, oscillating between 
qualitative and quantitative methods, and research questions sometimes have 
to be adjusted or revised. 

From the very beginning, it is important to plan for managing data files, 
backups and versions. Also consider the types of data that will be included 
and analyse the need for documentation needed for citations and reproduc- 
ibility. It is not necessarily a good idea to get a resolvable persistent identifier 
for every single data object. Instead, one should be pragmatic and consider 
the dataset as a part of the surrounding information universe and try to create 
meaningful, machine-readable and sustainable relations to that universe. Do 
not produce new data objects where you can reliably point to external ones. 
Also, one should be mindful about the granularity: Which are meaningful enti- 
ties to make findable and for which to create metadata? 

When it comes to infrastructures, we have to operate with what we have got, 
but historians could also give valuable input in creating a meaningful larger 
network of digital historical knowledge by engaging even more in questions of 
common or interoperable infrastructures. There are large infrastructure initia- 
tives like DARIAH-EU, CLARIN-ERIC, Europeana and the European Open 
Science Cloud (EOSC), but there is still not a suitable solution that would serve 
historians well in publishing and linking their research outputs. It is essential 
that historians discuss these questions with other stakeholders, the cultural 
heritage institutions, the scientific libraries and their own research institutions 
and funders to find sustainable solutions and drive infrastructure development 
in directions that serve knowledge creation, not only as separate projects, but 
as a linked network of information. 


Notes 


' Antonijevic & Stern Cahoy 2018. 

? FORCE11; Wilkinson et al. 2016. 

? Not invented here 2018. 

* Data science report 2016. 

5 Klein et al. 2014; Jones et al. 2016. 

* Finnish Committee for Research Data 2018; Research Data Alliance 2015. 
? Parland-von Essen 2017; see also openscience.fi. 

* See Finna.fi, kdk.fi and digitalpreservation.fi. 
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? Anderson 2007; Manovich 2013. 

10 See Mons 2018. 

! See, e.g., the DCP online guide on OAIS: Lavoie 2014 and the standard ISO 
16363:2012. 

? Language Bank of Finland. 

? PREMIS preservation metadata. 

14 Academy of Finland 2018b. 

5 Avoimuuden politiikat tutkimusinfrastruktuureissa: Selvitys 2015. 

16 RIs, https://research.fi/. 

17 Academy of Finland 2018a. 

18 Salgado 2009; Nygren 2013; Marttila 2018; Viinikkala et al. 2016. 

1 See, e.g., Wikisources, Wikimedia, Wikidata and Tieteen termipankki. See 
also Wikidocumentaries and Wikimaps. 
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CHAPTER 6 


Big Data, Bad Metadata 


A Methodological Note on the Importance of 
Good Metadata in the Age of Digital History 


Kimmo Elo 


Introduction 


During the past decade, digital humanities has emerged as a new paradigm 
seeking to gather scholars interested in applying computational methods on 
their research materials. This development has been supported by the almost 
exponential growth of either born-digital or digitised materials currently avail- 
able for researchers. Further, the availability of computational research tools is 
much better today than, say, five or 10 years ago. 

New terminology like big data, data mining and text mining well illustrate 
the massive growth of digital data available for research purposes. At the same 
time, the digital research agenda is filled with huge expectations regarding 
exploratory research, the growth of scientific and societal knowledge or new 
forms of data analysis. Some scholars have rather strong expectations about 
how digital humanities should change our whole understanding of knowledge 
and how knowledge is presented.' 

This chapter supports the general understanding of digital humanities as 
an important, computational field of research for the Humanities and social 
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sciences in general, and for historical research in particular. The chapter stems 
from the deep conviction of a scholar rooted in the intersection of computa- 
tional, historical and social scientific research that exploring digitised historical 
sources could help us to gain new insights and improve our understanding of 
the past. 

At the same time, however, this chapter is motivated by my worry that, as 
regards historical research, thus far much attention has been paid to the crea- 
tion of digital research material, but too little has been to paid to the creation 
of research data. To clarify my point, with research material, I refer to origi- 
nal, primary sources like documents, letters, photographs, etc. With research 
data, I refer to corpora consisting of both the original material and additional, 
descriptive information derived from the original material. To put it bluntly, 
we are almost over-flooded by the former, but there still is no shared or com- 
mon strategy about how to cope with the latter. The importance of the latter is, 
however, reflected by the fact that many universities are developing research 
data management practices.? 

The main thesis of this chapter is that more attention should be paid and more 
resources should be invested in metadata creation. The next section introduces 
the very concept of metadata and tackles the question of why metadata mat- 
ters. The second section presents arguments about why metadata should be 
considered as an important part of digitising projects. The chapter is rounded 
up with concluding remarks related to the future work in digital history. 


What Is Metadata and Why Do We Need It? 


Due to the limited space available for this chapter, I refrain from a literature 
review and just point out some of the most important aspects related to meta- 
data and discussed (mostly) by librarians or archivists. Metadata is widely 
understood and defined as ‘data about data and, thus, is expected to provide 
information about the content of the material it is linked with. In other words, 
metadata should summarise the most important content. According to The 
metadata handbook, metadata should be constructed in a way which 'fully sup- 
ports findability and discovery? 

Accordingto Allen Benson, metadata is a descriptive model, a summary report 
to present the main content according to a formalised structure consisting of 
information-bearing entities.’ Richard Pearce- Moses defines metadata creation 
as the ‘process of creating a finding aid or other access tools that allow individu- 
als to browse a surrogate of the collection to facilitate access and that improve 
security by creating a record of the collection and by minimizing the amount 
of handling of the original materials? Hence, metadata is an ontological model 
providing a structure for information arrangement. At the same time, metadata 
creation is a descriptive process aiming at filling in the ontological model with 
material-related, descriptive information. 
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I am quite convinced that the ontological side is not the core problem. Sev- 
eral well-developed models exists as to how metadata should be structured or 
what descriptive elements are available in order to guarantee a standardised, 
formalised metadata. Further, as regards born-digital materials specialists 
have been discussing from the late 1990s onwards how this development affects 
ontological requirements for the metadata." 

Hence, the real problem is the metadata creation process, especially when 
this process must be started from scratch and/or with only limited previous 
knowledge about the full content of the material to be modelled and summa- 
rised into metadata. Although the metadata should fulfil a relatively straight- 
forward task (namely, support findability and discovery), at least three main 
pitfalls should be taken seriously. 

First, what or who determines the elements included in the metadata struc- 
ture? The answer to this question widely determines the content described and 
formalised in the metadata. At the same time, however, it has a strong impact 
on both findability and discovery, since metadata queries are limited to the 
fields used in the model. A more complicated issue relates to hierarchies or 
sub-categories typical for historical sources (for example, 'building'- house 
or ‘building’ - church). Two examples should clarify the point. Let us first con- 
sider a novel. A standard metadata includes the author(s), the title, the pub- 
lisher, the year of publication, the genre and a few keywords used to summarise 
the main content. In most cases, these elements suit well the needs of a reader 
looking for certain novels. But how about a researcher looking, for example, 
for novels with a certain type of protagonist or a certain person/figure? Or, 
second, a photograph collection. Once again, many elements to be included in 
the metadata are quite straightforward and obvious (timestamp, photographer, 
title), but how about persons, places or abstract elements like gestures, memes 
or visual effects? The answers depend on the supposed group of end-users and, 
thus, make the material unusable or unfindable for certain groups. 

Second, what or who determines the terminology (for example, keywords, 
descriptions) used to describe content? Once we have determined what content 
should be summarised in the metadata, we need to determine how different 
content-related aspects are described. Once again, standardised dictionaries, 
keyword indices, etc. exist, so there is rarely a need to reinvent the wheel by 
creating a new vocabulary. The challenge is to maintain coherence; that is, 
to ensure that the same (or similar) content is described in the same terms. To 
use a simple example, if there are bunches of photographs all having different 
kinds of buildings in them, all of these photographs should be found if one 
searches for 'building. But should the end-user be able to find buildings of the 
type ‘church’ as well? Once again, findability should guide the process of meta- 
data creation. 

And, third, who creates and maintains the metadata? Prior to the digital 
era, collection management and metadata creation have been almost solely in 
the hands of librarians and archivists, especially when it came to the creation 


106 Digital Histories 


and maintenance of large document collections.’ Today, many collections are 
created, maintained and made available by private organisations, institutions 
and companies. This is partly due to the limited resources of public institu- 
tions like state archives or libraries, but also thanks to the reduced costs of 
digitisation, the increase of easy-to-use solutions for data management and 
hosting, and to the growth of data-sharing platforms like cloud-based ser- 
vices. The other side of the coin is that a majority of these platforms is rather 
weak and underdeveloped in metadata creation and maintenance, especially as 
regards the content description. One solution enjoying growing popularity is 
‘crowdsourcing’ a process where ‘ordinary people’ help the maintainers to cre- 
ate descriptive metadata. There are many examples ranging from ‘tagging’ over 
‘person identification’ to ‘linked data creation, all of them producing interest- 
ing and promising results, but also highlighting many problems mostly related 
to the heterogeneous quality of the resulting metadata and difficulties in ensur- 
ing the correctness of input.’ 


Why Digital History Should Take Metadata Seriously 


A quick survey in recent literature around digital history reveals that questions 
related to metadata creation have rarely been debated among digital historians. 
Instead, historians seem to be educated to use metadata when searching for 
sources, not to question the metadata itself. In other words, we are used to rely- 
ing on metadata created by archivists or librarians." This was a good practice 
in the times when collections were mainly and dominantly housed by libraries 
and archives. 

The digital era has already changed this division of labour, and there is no 
evidence whatsoever that this would change in the future. Quite the contrary, 
billions of gigabytes of born-digital textual and visual materials are produced 
and made available without any, or with only weak and incomplete, metadata. 
However, without a proper metadata, materials 'are simply a meaningless col- 
lection of files, values and characters:!! And as Edelstein and colleagues point 
out: 'Historians increasingly find themselves utilizing digital databases as the 
idea of the searchable document and the virtual archive reorganize how librar- 
ies, research institutes, teams of scholars, and even individual researchers pre- 
sent and share interesting sources?” 

Quite much effort, money and time have been invested in the digitising 
of historical textual materials like manuscripts, documents, letters, etc. As a 
result, historians have access to a vast number of digitised text and can view 
and query digitised indexed document collections and editions online. One of 
the most prominent examples is the ‘Republic of Letters’ project, focusing on 
historical networks of correspondence between scholars from all around the 
world.? Another similar project is the Letters of 1916 Digital Edition’ project, 
one ofthe first crowdsourced humanities projects, as well as histoGraph, which 
also uses crowdsourcing for metadata creation." 
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In their evaluation of the ‘Letters of 1916’ project, the authors note that '[t]he 
meaning of the term “metadata” was unclear for most participants} This seems 
to be linked to a wider aspect, namely that '[m]uch attention in the past fifteen 
years has been directed toward text digitization," forcing ‘scholars to access 
historical sources in a new way: through specific words.” As a result, most 
digitised collections available online are ‘focused on searching, not browsing. 
Hence, findability might be good (thanks to the power of full text search in 
digitised text documents), whereas discovery might be poor. 

Modern text mining methods can be of help when historians are dealing with 
digitised textual corpora. Further, computational methods like (semi-) auto- 
mated document classification or indexing can make the metadata creation 
process easier and more effective. However, the current tendency to make old 
documents available as PDF collections worsens the situation. The positive 
thing in using the so-called layered PDF format is that end-users can see the 
original document, but also use search and copying functionalities through 
the text layer. The negative side is that in most cases the text layer is an exact, 
character-based reconstruction of the page (mostly based on the corrected 
results from the optical character recognition (OCR) process), not a raw text 
laid out and paginated according to the original design. As a result, hyphened 
words, to give an example, on two lines are not understood as one, but as 
two separate words (of which the first ends with a hyphen!). My reader can 
imagine what kinds of limitations result from this kind of practice for document 
discovery, even if the research interface offers expanded search capabilities 
like regular expressions. This is because most search engines are based on pat- 
tern matching, whereas, for example, irregularly split words do not have a dis- 
tinct pattern. 

Another growing challenge is that sources relevant for historians and social 
scientists include not only textual collections, but also visual or audio materials 
like photographs, music, films and so on. Although the question of metadata 
creation is relevant for all digitised collections, the real challenge relates to 
non-textual materials. Since the share of information delivered in non-textual, 
mostly visual forms is steadily growing, the problem of findability and discov- 
ery of such materials is of increasing relevance also for historians. There exists 
already vast collections of such materials, but at the same time our tools to 
directly query visual or audio materials are very limited, yet slowly improving.” 
For example, many digitised historical photographs include non-recognised 
persons or places, but the problem is also relevant for today. According to 
de Figueirédo and Feitosa '[a]pproximately 350 million photos are added 
to Facebook each day[, but most of them] are not annotated? The problem 
here is not just about forgetting, but also about findability and discovery. Non- 
annotated photographs cannot be queried, and they do not appear in search 
results, even if their content was relevant for the query. How are we expected to 
find, for example, photographs with ‘Konrad Adeanuer' on them if we lack both 
techniques to identify (that is, to name) persons behind recognised faces and 
metadata containing information about persons shown on the photographs? 
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Many recent articles point out that digitised collections and online resources 
affect the way in which scholars discover and access historical sources. Instead 
of selecting research material from the sources by close reading, research mate- 
rial is increasingly selected by using search engines or by applying methods of 
computer-aided, distant reading. Two biasing consequences seem worth being 
noted. First, the use of search engines and other online resources might influ- 
ence and steer scholars to favour materials available online and, consciously 
or unconsciously, to change their research questions to suit digitally avail- 
able materials. Second, scholars might not be aware of missing or incomplete 
metadata possibly affecting and limiting research results. This second aspect is 
especially relevant for non-textual material collection, but has at least some rel- 
evance also in regard to textual data offered as simple, non-indexed PDF docu- 
ment collections. Another problem is that many collections do not provide any 
information about the completeness (or better: incompleteness) of their data. 


Discussion 


This chapter has tackled the question of the relevance of metadata for histori- 
cal research. Metadata is understood as ‘data about data, an ontological model 
summarising the main content of the data. The very idea of metadata is to 
make the source material findable and discoverable. In the current digital era 
characterised by the exponential growth of digitised materials and the avail- 
ability of vast online resources, both goal-settings gain in importance also for 
historical research. 

Based on the arguments presented above, I conclude that metadata is 
extremely relevant also for historians. On the one hand, historians increas- 
ingly use and explore online resources like historical document collections 
or photograph corpora. Most of these online portals offer search engines or 
other possibilities to query the collections. Instead of selecting material by the 
process of reading the material document by document, material selection is 
increasingly based on search results. Since there is no reason to believe that this 
will change in the future, historians should be interested in ensuring that all 
relevant aspects are searchable, findable and discoverable. 

On the other hand, the whole collection management is in flux, as digit- 
ised collections are made available by a wide variety of actors. If there exist no 
standards for quality management of data collection, how can findability and 
discovery be guaranteed? Once again, the ontological side is not the problem. 
The problem is the process of creating annotations and metadata. 

A third aspect should be added to the two points above. Historical digitisa- 
tion projects often deal with materials of which only trained historians possess 
knowledge. With all respect to librarians and archivists, we cannot expect them 
to have an in-depth knowledge of historical persons, events or eras. Despite 
this, these two groups are still in charge when national, governmental and offi- 
cial collections are digitised and annotated with metadata. 
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Although there is no easy patent solution regarding how to ensure metadata 
quality for historical collections, historians should be encouraged to engage 
in digitisation projects in their own fields of expertise. As Reilly point out, 
libraries, but also archives, ‘must ensure that they maximize the visibility of 
their collections—not just to the general public but to those in the education 
system.” In this respect, historians should engage as mediators between the 
research community and libraries and archives. 

Historians value original documents and are trained to source criticism 
and to work in archives. At the same time, they are quite reliable on what is 
involved in the quality of collection management and hosting in archives, and 
many archivists and librarians enjoy a high respect for their expertise. A good 
archivist can fill the gaps in a researcher' inquiry and, thus, find relevant and 
reliable sources. 

The shift from this human-to-human interface towards a human-to-computer 
interface replaces the ‘silent knowledge’ of an archivist with algorithms run 
by the computer. The search process itself might be more effective and quicker, 
but the other side of the coin is that the user has only limited possibilities to 
explain her intentions. As pointed out above, a scholar is forced to figure out 
correct terms and words for his query, but still he cannot be sure whether he 
receives all (or even the most) relevant materials. 

To round up my argument: it is by far not sufficient to digitise original sources 
if we cannot ensure findability and discovery. Digitised original sources must 
be processed into research data consisting of the original content plus descrip- 
tive metadata summarising the essential content of the material. Metadata crea- 
tion should not be disparaged, nor should it be seen as a quick, dirty task to 
be completed as soon as and as inexpensively as possible. Research data is the 
most valuable content of a vast material collection, since it enables both find- 
ability and discovery. If scholars cannot rely on getting reliable results when 
committing searches in online collections, the digital leap manifested by pro- 
ponents of digital humanities might end with a belly flop. 


Notes 


! See, e.g., Burdick et al. 2012. 

> See, e.g., https://www.helsinki.fi/en/research/research-environment/research 
-data-management. 

> Register & McIlroy 2015. 

* Benson 2009: 161-162. 

5 Pearce-Moses 2005: 112-113. 

6 Benson 2009; Gonzales 2014; Valentino 2017. 

7 Langdon 2016. 

* Edelstein 2017: 401. 

? See, e.g., Stvilia 2009; Reilly 2012; Stvilia 2012; Turin 2015; Valentino 2017; 
Wusteman 2017. 
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CHAPTER 7 


All the Work that Makes It Work 
Digital Methods and Manual Labour 
Johan Jarlbrink 


Automation is a temptation and a promise, and perhaps a threat. Old jobs dis- 
appear as robots and software do what human workers used to. Is this also 
the case with research within the humanities? Computers can process datasets 
of texts so large that it would take several lifetimes for scholars just to read 
it. Computers are excellent in finding patterns that are hard to recognise for 
human eyes and brains. What should researchers do when computers are much 
better in doing what scholars used to? 

In this chapter, I will argue that digital research is far from automatised.! 
A human being is still needed to make sense of results, of course. I will focus 
on something else, not on the creative ways in which scholars interpret 
data outputs, but on the dull tasks that make data outputs possible. Most data- 
sets need cleaning, editing and error checking. The outcome of automatic 
processes needs to be examined by someone who goes through the results; 
sometimes it needs to be corrected manually. Such procedures are often left out 
completely or only mentioned in brief when digital methods are discussed. Yet, 
they have a significant impact on results and need to be taken seriously. 

I will mainly focus on various forms of text analysis, based on my own expe- 
riences and what colleagues have told me, as well as cases described in the 
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literature. The cases are meant to shed light on an often neglected part of digi- 
tal methodologies, but the mundane aspects of data cleaning and curation are 
also significant beyond the field of digital humanities. Such procedures can be 
understood as ‘a crucial part of the materiality of how scholarly and scientific 
work is done? The manual work needed to feed, improve and evaluate digital 
processing belongs to a long history of little tools and (supposedly) insignificant 
back-end operations that have made different kinds of research output possible. 
Digital scholarship, as traditional archival research and experimental work in 
laboratories, involves material and conceptual actors as well as human ones? 

In the first section, I will give a short background and explain why I think 
manual digital work matters. Three empirical sections will exemplify various 
kinds of manual operations. First, I describe human-assisted computational 
analysis in the humanities in the 1950s, 1960s and 1970s. Second, I present my 
own experiences from a project based on 19th-century newspapers. Third, I 
tell the story of how a colleague of mine used digital Named Entity Recognition 
(NER) in combination with pen and paper. 


Invisible Work 


Glimpses of the manual work that makes digitisation and computational analy- 
sis possible are sometimes given by accident. Google Books preserves a large 
part ofour printed cultural heritage in a digital form, but also some of the hands 
that were needed to operate the scanners and handle the printed volumes. Just 
as the secretaries of the early 20th century, who left traces of themselves in the 
typewritten texts only as a result of errors, accidents make Google employees 
become visible in the digital database. Index fingers covered in condom-like 
pink gloves are included in many ofthe images now available online. They serve 
as a reminder of the people and work that feed the digital infrastructures.* Part 
of the workforce digitising printed materials is less visible. Much of the post- 
processing needed to produce useful digital surrogates is being outsourced to 
companies hiring low-wage workers in Cambodia and India. 

This kind of hidden work makes digitisation seem more straightforward and 
automatised than it is. The same goes for various forms of computer-assisted 
analysis. Tamraparni Dasu and Theodore Johnson has stated that: 


In our experience, the tasks of exploratory data mining and data clean- 
ing constitute 8096 of the effort that determines 8096 of the value of the 
ultimate data mining results. Data mining books ... provide a great 
amount of detail about the analytical process and advanced data mining 
techniques. However they assume that the data has already been gath- 
ered, cleaned, explored, and understood.* 


Much of the cleaning can be done with software. Even an easy-to-use program 
such as Microsoft Excel allows you to search and replace, filter, merge, separate 
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and delete different kinds of data. More advanced or custom-made tools allow 
you to fine-tune the process. Still, such procedures need to be monitored in 
order to ensure the quality of the outcome. Sometimes software fail, and some- 
times they need assistance from human pattern recognition. With a limited 
dataset, it can be more efficient to correct and edit by hand instead of spending 
time finding and running a software that will require additional and manual 
error checking anyway. 

Algorithms solve problems according to specified rules. That is why they 
may be of limited use if a dataset is noisy and patterns are irregular. ‘Signals 
are always surrounded by noise, even to the extent that we cannot always 
decipher which is which? Hadley Wickham explains (alluding to Leo Tol- 
stoy) that ‘tidy datasets are all alike but every messy dataset is messy in its own 
way.’ A dataset can be corrupt in numerous ways, but there is only one way in 
which it is flawless. The multiple possibilities of errors, and the irregularity of 
their occurrence, can make it difficult to specify the rules on how to solve prob- 
lems algorithmically. In some cases, the fastest way may be to do some of the 
work manually. 

As Dasu and Johnson point out, cleaning has a significant impact on results. 
Yet, detailed discussions on cleaning and error-checking processes are rare in 
introductions and chapters on methodology. Introductions usually describe 
digital tools, not manual or semi-manual tasks.’ The role of digital tools and 
models is often discussed in terms of black boxes, with an input and an output 
and an obscure software in the middle. Such black boxes must be opened up 
in order to make research processes transparent. Manual and semi-manual 
procedures can be said to represent another black box, however, perhaps even 
more opaque. They can be difficult to describe in a transparent way since they 
rely on human pattern recognition, a sensitivity to individual cases and the 
ability to make informed distinctions between information and noise. 


A History of Manual Labour 


As Markus Krajewski has pointed out in his media history of service, before 
digital servers there were human servants: human calculators, research assis- 
tants and secretaries.! The birth of automatised data processing did not do 
away with them. When Vannevar Bush speculated on the future research 
potentials of computers in 1945, he described a machine that *will take instruc- 
tions and data from a roomful of girls armed with simple keyboard punches, 
and will deliver sheets of computed results every few minutes?! Father Robert 
Busa is often referred to as the first scholar to use the capabilities of computers 
within the humanities. However, his project also involved 'a roomful of girls: 
His interest started in the 1940s when he studied the preposition 'in' in the 
works of Thomas Aquinas. This research would clearly benefit from the tech- 
nologies developed to speed up data processing in business and administration. 
Busa partnered with IBM and during the following decades they constructed 
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an index of the full vocabulary in the works of Aquinas, published in 1974 as 
Index Thomisticus. In words that echo in recent publications on distant read- 
ing, Busa stated that: “What had first appeared as merely intuition, can today 
be presented as an acquired fact: the punched card machines carry out all the 
material part of the work of (making a concordance)?? 

The process was far from automatic though. The mainframe computers avail- 
able at the time required ‘a constant procession of human servants." In 1964, 
Busa had a team of 60 people assisting him with editing, programming and 
machine operations. Around 35 staff members were required for key-punching 
texts, verifying, listing, sight-checking and punch-card processing (the data 
was later transferred to magnetic tapes). In 1951, he estimated that it would 
take four years to complete the index. The reason why the project was not fin- 
ished until the mid-1970s was mainly the laborious work of pre-editing and 
proofreading. ‘Busa calculated that the thirty years of work he and others had 
spent on it amounted to roughly one million man hours??? The foundational 
project of what would become digital humanities was truly a manifestation of 
the manual work needed to process data with computers. 

The labour-intensive process did not discourage other scholars from using 
computers in their research (perhaps because those who introduced new 
methods seldom emphasised the importance of manual tasks). When the Index 
Thomisticus was completed in 1974, Busa was no longer alone. Linguists were 
among the early adopters, as well as some historians. Swedish historians 
were introduced to the idea that ‘Clio faces automation’ in an article by Carl 
Góran Andre from 1966. Andre explained that modern computers provided 
solutions to problems related to massive source materials. With data coded 
onto punch cards, or optical and machine-readable paper forms, it was possi- 
ble to sort large amounts of data mechanically or electronically. In many cases, 
the systems were used as search engines, but they could also perform statistical 
calculations. The examples he gave included databases of coded newspaper 
articles, correlations between election results and census data, and the geo- 
graphical distribution of unions and memberships in popular movements. 
Andre concluded, as Busa before him, that: "Ihe mechanical work can now be 
left to computers." 

Details on the actual research process are rare in publications by the first 
generation of computer-using Swedish historians." Assistants, secretaries and 
machine operators may have been essential parts of the research process, but 
they were rarely acknowledged in the end results. Some clues can be found, 
however, and the impression they give is quite different from Andras opti- 
mistic view. The most laborious tasks concerned coding, in this case referring 
to the transfer of data from source documents to machine-readable formats 
(punch cards or optical markings on paper forms). A Swedish pioneer, the 
press historian Stig Hadenius, explained in 1967 that it took ‘not more than 16 
people’ to extract the data needed for a pilot study on political news between 
1896 and 1908. A large project on Sweden during the Second World War had 
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a group of researchers investigating newspaper debates during the war. In order 
to render the newspaper material searchable, they coded 165,000 articles to 
create an index based on punch cards. The research team manually coded 138 
variables for every article.? In the 1970s, a series of dissertations from Lund 
University used similar methods to process newspaper articles on various top- 
ics during the postwar era. Gunnel Rikardsson, who wrote about The Mid- 
dle East conflict in the Swedish press (1978), did not elaborate on the manual 
tasks, but explained that six people had been involved in the process and that 
the ‘coding work was experienced as exacting, mainly due to the high degree 
of concentration needed. When the newspaper data was finally coded, 
however, the computer took over the workload: ‘Manual processing had not 
been possible.” 

In his article from 1966, Andre speculated on future research possibilities. 
Governmental agencies were already using computers to store and process 
data. Thus, for future historians who wanted to analyse the data, computer 
skills would be an absolute necessity. Most of the sources that historians worked 
with in the 1960s and 1970s were not ‘born digital though. The technologies 
(such as Optical Character Recognition, OCR) transferring analogue data to 
digital media showed promising results, but the majority of the research pro- 
jects relied on manual labour. Millions of hours were spent on manual coding, 
punching and proofreading. The name of the research centre founded by Busa 
in the early 1960s was Centro per LAutomazione dell'Analisi Letteraria. Yet, and 
contrary to the automation emphasised in the name, photographs from the 
centre show what was often left unnoticed when research output was presented: 
rows of human operators, most of them young women.” 


Struggling with Noisy Newspapers 


The manual tasks needed today are of a different kind. The digitisation of 
sources is part of many research projects, but with scanners and software for 
OCR the digitisation of printed texts can be more or less automatised. Even 
handwritten texts can to some degree be digitised with the help of OCR tech- 
nology. A significant difference, though, is that archives and libraries do much 
of this work for us. This is especially true for newspapers and books, parliamen- 
tary records and collections of audio-visual media, paintings and maps, and 
other museum artifacts. As long as the copyright allows for it, texts and images 
are made available online. In most cases, we do not need (and cannot afford) 
35 assistants transferring data from one medium to another. Full-text search, 
topic modelling and tools for text analysis often make it unnecessary to code 
individual texts manually. 

And yet, not all datasets are ready for processing out of the box; many of them 
can be very messy. As Carl Lagoze has pointed out, traditional archives and 
libraries used to guarantee the integrity of their records, at least in principle. 
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Control and curation were meant to facilitate the provenance and stability of 
data. The digitisation of collections and archival records has meant a fractur- 
ing of this control zone.” When millions and millions of pages are transferred 
(or translated) into digital formats, no one can guarantee the integrity of the 
data anymore. For large datasets of non-canonical texts in particular, libraries 
have spent less resources on curation, leaving researchers with much of the 
cleaning and preparation. Newspaper databases are notorious in this respect. 
Frequent OCR errors are well known, problems related to text segmentation 
less so, but both kinds of errors make it difficult to process the texts without 
manual interventions. 

In one of my projects, I wanted to analyse discursive patterns in newspaper 
reports about the electrical telegraph in mid-19th-century Sweden.” From the 
National Library of Sweden, I was able to download a complete dataset cover- 
ing one major newspaper from 1830 to 1862, about 10,000 pages. A systems 
developer helped me to penetrate the data (the first person who was asked 
refused to work with a dataset this noisy). Our first goal was to find every arti- 
cle containing the words ‘electrical’ and ‘telegraph’ (‘elektrisk and ‘telegraf’ in 
Swedish). Since we expected a high frequency of OCR errors, we used a Leven- 
shtein distance to identify corrupted versions of our keywords, allowing three 
characters to be added, replaced or missing. In this way, we got 489 different 
hits for ‘electrical and 4,017 for telegraph. This was all done with a few simple 
commands, and the result came quickly.” 

Not all of the hits had anything to do with the electrical telegraph though, 
and in order to filter out the false positives I had to go through the lists manu- 
ally. That ‘dialektisk and ‘apoplektisk’ referred to something else was easy to 
figure out, but what about ‘pelektriska’ and ‘elepris’? What about ‘tograf? ‘tfies- 
raf’ and ‘ttlefrnf’? Such combinations of characters can only be interpreted in 
the context of their appearances in the newspaper. In order to single out the 
proper keywords, I had to search the database and read the texts. It turned 
out that many of the incomprehensible words generated by the OCR actually 
referred to the electrical telegraph. My corpus would have been much smaller 
if I had not spent some time on this semi-manual step. 

With an edited list of keywords, it was possible to locate every 'textblock in 
the XML-files where ‘electrical’ and ‘telegraph’ co-occurred. A textblock is a 
unit of text identified as a coherent text by the text segmentation tool used in 
the digitisation process. However, nineteenth-century newspapers are difficult 
to process for the tool. The small print, the lack of headlines and the packed 
columns give few graphical clues on where one text finishes and another one 
starts. Human eyes can see it quite easily, while digital tools make several mis- 
takes. Many libraries send the auto-segmented newspaper pages to private 
firms with outsourced divisions in Eastern Europe, Cambodia and India. The 
job of the staff is to correct the segmentation where it has failed.” The National 
Library of Sweden have skipped this crucial step in the process, however. I 
had to do the job myself. 
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We soon discovered that the textblocks generated by the tool had little to 
do with the texts as they were printed in the paper. Short news items from the 
same column were regularly merged into one single textblock, and longer texts 
chopped up into shorter pieces. The only way to single out the texts I wanted 
was to read through the whole corpus of identified textblocks and delete the 
unrelated parts. I also deleted text lines and combinations too difficult to 
decipher, such as IPIApfos2kOS2viKfSbmNAT and Tilet4R12bin1dPRRmo- 
8botoFrfutmfsOMMFgpgFvf* I did not read the texts as carefully as I would 
have done if close reading was my main research method. But still, I had to 
read them. 

With a somewhat clean dataset we could finally start to explore what the 
texts had to say about the electrical telegraph. We used a fairly simple and 
transparent method to identify semantic patterns. We looked at words co- 
occurring in a sliding window, and used the network analysis tool Gephi to 
find clusters of frequently co-occurring words. We still had some problems 
with noise though. Our method identified co-occurring words no matter the 
quality of the OCR, but for the final analysis we wanted to merge corrupted 
versions with the uncorrupted (for example, 'oeanen' and ‘oceaner (the ocean), 
‘Mo«se’ and Morse). Once again, we used a Levenshtein distance to pick 
out the most likely candidates to be merged, but I went through the lists to 
confirm the results manually. 

In the end, we came up with some new and fascinating results. Many of the 
ideas we frequently associate with the electrical telegraph were more or less 
absent in the newspaper reports. Very few mentioned anything about the uto- 
pian potential of the new medium, it was not seen as an immaterial way of 
communicating and the idea that it ‘freed communication from the constraints 
of geography’ must be contextualised.? A bureaucratic discourse on regulation 
was much more prominent than a utopian on liberation, many of the articles 
described the material components of the new network instead of immaterial 
flows of electrical signals, and the geographical prerequisites (such as ocean 
floors and mountains) that determined where cables could be laid out were 
described in detail. I recognised much of this already when I read the texts in 
order to delete the noise, but I believe the quantitative analysis made the con- 
clusions more convincing. 

Scholars writing about computational text analysis usually emphasise the 
need to combine distant and close reading.” You need to switch between 
different perspectives to get an understanding of general patterns, as well as 
individual cases. In my own research, I already had to read the texts more or 
less closely in order to clean and prepare the corpus. When I reviewed the lists 
and graphs of frequently co-occurring words, I had an in-depth knowledge 
about the dataset on which they were based, making it easier to interpret the 
output. The time I spent reading and editing turned out to be well invested, 
but the process was very different from what I had imagined when I started 
the project. 


120 Digital Histories 


Recognising Named Entities 


What media technology we consider to be the first one ever invented depends 
on our definition of media. One common definition emphasises that a medium 
is a technology for the storage and/or transfer of information.” In that case, the 
tally stick might be the oldest media technology in human history. A tally stick 
keeps track of things you want to count (days, people, objects, etc.) and makes 
it possible to save the counts for later and to transport them from one place to 
another. The oldest one found, a bone from a baboon with carved markings, is 
at least 40,000 years old. 'Although our ancestors could not have known it, their 
invention of the notched stick has turned out to be amongst the most perma- 
nent of human discoveries?” That my colleague Erik is using their invention to 
keep track of an imprecise digital tool in 2018 would definitely be beyond their 
imagination. Erik counts on paper though, not a bone from a baboon. 

Tools for NER make it possible to identify and extract names of persons 
(even mythological creatures), organisations and places in digitised texts, as 
well as expressions of time (1857, ‘next week’), monetary values and so on. 
The extracted data can be used for geographical visualisations, for network 
analysis, in timelines and as building blocks in other kinds of text analysis. 
HFST-SweNER, a language-processing technology developed to extract named 
entities from Swedish texts, is based on a dictionary as well as rules for identify- 
ing entities not in the dictionary, but likely candidates based on their contexts. 
Tests have shown that it works fairly well for a curated corpus of texts from the 
1990s, but will it work for 19th-century newspaper texts? 

Erik Edoff is a media historian interested in geography. In one of his pro- 
jects, he tries to figure out how new communication technologies in the 19th 
century reorganised the notion of space?! Was the world getting smaller when 
telegraphs, railroads, canals and steamships made it possible to communicate 
across space in a shorter time or in no time at all? Did far-away places come 
closer as a result of a time-space compression? One way to examine this (but 
certainly not the only one) is to identify and count place names in newspapers 
before and after the introduction of the new technologies (Erik selected papers 
from 1850 and 1890). Were names of distant locations printed more frequently 
when news travelled faster? The first results generated with NER indicated that 
places in the local region were in fact getting relatively more attention when 
new connections made communication faster, compared to places outside of 
the region. These were exciting results, since they seemed to show that the 
impact of the new technologies was different from what is usually believed. 
The question then became whether these numbers could be trusted. Did the 
tool find all the place names printed in the papers? If not, was it biased towards 
local Swedish place names? 

In order to calculate the precision and recall, Erik chose a few newspaper 
issues for every title and year in the corpus. He read through the NER-tagged 
text files manually, and kept track of valid hits and false negatives in two 
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Figure 7.1: How many named entities (locations) did NER find, and how many 
more did Erik find? New locations not tagged by the tool were recorded on 
post-it notes. Source: Author. 


columns on a couple of paper sheets. The method of counting was basically the 
same as the one used by our distant ancestors making notches on a bone: one 
mark for every word counted (see Figure 7.1). The brackets enclosing some of 
the counts separate place names mentioned in advertisements and lists, such 
as weather reports, stock market prices, etc. Those entities were more difficult 
to identify for the digital tool. There are other and perhaps more sophisticated 
ways to count occurrences of place names. But pen and paper are often efficient 
tools for minor tasks. No downloading or installing is required, and no special 
training. The interface makes the paper easy to use, and it is highly flexible. 
The manual control revealed that the tool had left several place names 
untagged. For some reason, it did not recognise locations such as Paris, Kiel or 
Swinemünde, nor the Swedish towns Gavle (in the 19th century: Gefle), Váxjó 
(Wexió) and many minor towns and villages. One explanation might be the 
old spelling, but in some cases (when the spelling changed between 1850 and 
1890), the tool recognised the old spelling, but missed the new. And the spell- 
ing does not explain the case of Paris. One geo-administrative category was left 
untagged almost completely: the parish. Today, it is hardly used outside of the 
Swedish church, but in the 19th century it was one of the most common ways 
in which Swedish locations were identified. Apart from these place names, Erik 
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found several locations untagged because of OCR errors. All of the entities 
identified manually were fed back into the system in order to make the final hit 
list more complete. 

It turned out that the trend indicated by the first results was even more prom- 
inent once the false negatives were included. The relative frequency of places 
in far-away countries did not increase with the introduction of new commu- 
nication technologies. Rather, locations close to the towns where the newspa- 
pers were published got more attention in 1890 compared to 1850. Erik's close 
reading of the sample issues provided him with some possible explanations. 
New places were put on the map thanks to new communication technologies: 
railway intersections, telegraph stations, bridges where steamships picked up 
passengers and goods, locks connecting canals and lakes. The places most fre- 
quently mentioned were those in the region, such as neighbouring towns and 
villages connected by railway, harbours close to home and regional centres 
nearby where telegrams were sent. New communications brought neighbours 
together. What was already close came even closer, while distant places were 
as far away as they were before. The repetitive task of recording place names 
on paper paid off in an interesting and convincing analysis. NER was a helpful 
tool, but it needed human assistance. 


Troubleshooting Black Boxes 


Digital models and tools will continue to improve. In the future there will, 
hopefully, be no need to carry out many of the manual tasks described in this 
text. OCR is getting more accurate every year; for some languages, NER seems 
to work fine already. On the other hand, as digital research practices are becom- 
ing more widespread, researchers will try to use the methods for new kinds of 
materials and in new areas—even areas where they will not run as smoothly. 
If we limited our research to clean datasets, very little would be accomplished. 
Many of the manual tasks carried out by research assistants and undergradu- 
ates in the 1960s are automatised today. New tools can achieve things unthink- 
able 50 years ago, but not always without human interventions. New problems 
seem to arise as old ones are taken care of. 

The long history of information management can be seen as a series of new 
solutions generating old problems. In a fascinating article about the paper tech- 
nologies used by Carl Linneus, in his big data-project on the natural system, 
Staffan Müller-Wille and Isabelle Charmantier note a ‘curious dynamic’ in the 
attempts to master information overload. “The many technologies that were 
designed to contain information actually fuelled its further production, partly 
by providing platforms for more efficient data accumulation, partly by bringing 
to the fore new structural relations and patterns within the material collected??? 
The result of technologies, developed to create order, overview and searchabil- 
ity, is often a new information overload. The digital media of today have other 
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capabilities than Linneus' paper slips and lists, but their operations are not as 
precise and clean as we might think. Rotten data, spam and noise thrives in 
a digital habitat (an interesting research topic in itself).? As shown by libraries’ 
digitisation efforts, new technologies are far from perfect and human assistance 
is sometimes needed to keep them on track. 

To edit, clean and validate large datasets manually or semi-manually may 
seem highly ineffective. In many cases, however, these procedures can be quite 
effective. Reading, counting, deleting and merging texts and other kinds of data 
in a manual or semi-manual fashion is a way to bridge distant and close read- 
ing. Insights from such encounters with data can be fruitful in the final analysis. 
It might also be a way to dig deeper into the inner workings of the digital tool 
on which the researcher is relying, to figure out how a specific dataset was pro- 
cessed and why the output turned out as it did. Troubleshooting is a good way 
to start if we want to examine what is inside the black boxes. 


Notes 


! The research presented here is part of the project "Digital Models: Techno- 
historical collections, digital humanities & narratives of industrialisation; 
funded by the Royal Swedish Academy of Letters, History and Antiquities. 

? Star 2002: 109. 

> On the role of marginal (and yet central) figures, actions and technologies 
in the history of science, see Becker & Clark 2001 and Krajewski 2018. 

* Thylstrup 2018: 42-43. See also Price & Thurschwell 2005. 

* Fyfe 2016. 

5 Dasu & Johnson 2003: ix. 

7 Parikka 2012: 111. 

* Wickham 2014: 2. 

? See, e.g., Jockers 2013; Graham, Milligan & Weingart 2016; Rockwell & 
Sinclair 2016. 

© Rieder & Róhle 2012. 

1 Krajewski 2018. 

12 Bush 1945: 104. 

? Robert Busa quoted in Burton 1981: 1. 

14 Krajewski 2018: 308. 

1 Burton 1981: 3. 

© Andre 1966: 96. 

7 Jarlbrink 2015. 

18 Hadenius 1968: 68. 

1 The coding manual is now available online. See Amark 2013. 

2 Rikardsson 1978: 59-60. 

?! Jones 2018. 

” Lagoze 2014. 
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% Jarlbrink 2018. 

% The newspaper noise is further explored in Jarlbrink & Snickars 2017. 
25 Fyfe 2016: 565. 

°° Carey 2008: 157. 

% Jockers 2013: 26; Blevins 2014: 126; Hitchcock & Turkel 2016: 953. 
28 Mitchell 2017. 

? Tfrah 2000: 64. 

3 Kokkinakis et al. 2014. 

%1 Edoff, forthcoming. 

? Müller-Wille & Charmantier 2012: 4. 

33 See Parikka & Sampson 2009; Eriksson 2016. 
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PART III 


Distant Reading, Public 
Discussions and Movements 
in the Past 


CHAPTER 8 


The Resettlement and Subsequent 
Assimilation of Evacuees from 
Finnish Karelia during and after the 
Second World War 


Mirkka Danielsbacka, Lauri Aho, Robert Lynch, 
Jenni Pettay, Virpi Lummaa and John Loehr 


Introduction 


The consequences of forced migrations are felt globally and are faced by millions 
of people each year. A critical question is how these refugees adjust to their 
new environments and eventually integrate into the host population. A num- 
ber of factors can influence the ultimate assimilation of migrant populations 
and these are frequently related to the characteristics of the migrants (for 
example, demographic variables and socio-economic background), flight 
(for example, cause of flight), host country or region (for example, natural 
resources) and the resettlement policies of host populations. One way to meas- 
ure the successful settlement and assimilation of displaced populations is to 
look at the number of times an individual relocates after their initial arrival 
in a host country and to analyse which factors affect these moves. In gen- 
eral, the more individuals move, the less likely they are to integrate.' In this 


How to cite this book chapter: 

Danielsbacka, M., Aho, L., Lynch, R., Pettay, J., Lummaa, V., & Loehr, J. (2020). The 
resettlement and subsequent assimilation of evacuees from Finnish Karelia during 
and after the Second World War. In M. Fridlund, M. Oiva, & P. Paju (Eds.), Digital 
histories: Emergent approaches within the new digital history (pp. 129-147). Helsinki: 
Helsinki University Press. https://doi.org/10.33134/HUP-5-8 


130 Digital Histories 


chapter, we investigate the evacuation of Karelia—a unique forced migration 
event that took place in Finland during the Second World War—and the socio- 
demographic and environmental factors associated with the relocation and set- 
tlement of Karelian evacuees during and after the war. 

Our methodological approach builds on the socio-economic tradition of 
conducting digital history in which we use quantitative data and methods to 
analyse and interpret a question of historical importance (the assimilation of 
Karelian evacuees). Therefore, it might be more accurate to define the meth- 
odological approach used here as quantitative history, rather than ‘new digital 
history, even though it can certainly be argued that all quantitative history is 
essentially digital history. In line with many of the new digital history projects, 
our data has been digitised and compiled with the help of new digital tools 
(see ‘Material and Methods, below) and we have an interdisciplinary research 
team of biologists, computer scientists, sociologists and historians with consid- 
erable experience working on these and similar datasets. However, analysing 
these newly extracted data has been executed with rather common statistical 
methods (for example, regression models) that are in line with the quantitative 
tradition of social and economic history. 

During the Second Word War, an estimated 40 million Europeans fled their 
homes in what is widely considered to be the worst refugee crisis in modern 
history.’ Finland faced this problem after it ceded Karelia to the Soviet Union in 
the aftermath of the Winter War (1939-1940) and once again after the Contin- 
uation War (1941-1944). Almost all Karelians were evacuated to the remaining 
parts of Finland. It has been said that Karelians have had the 'sad privilege of 
being the only refugee group in the world to have been displaced three times 
within a short period of four years— 1940-1944’? Two of these displacements 
were forced and resulted from the Soviet occupation of Karelia in the Winter War 
and at the end of the Continuation War, but one, during the Continuation 
War, was a voluntary migration back to recaptured Karelia. 

Previous historical and cultural studies of Karelians have concentrated on 
describing the Karelian evacuees and their assimilation in Finnish society, 
Karelians memories of the evacuations and the land they lost in Karelia, and 
the resettlement policy of the Finnish government.* In addition, previous soci- 
ological and epidemiological studies of Karelian evacuees have mainly focused 
on the long-term effects of forced migrations on mortality,’ income? or socio- 
economic status? by comparing displaced Karelians with the rest of the Finnish 
population. These studies have frequently been conducted with the same 1096 
sample data (n = 411,629) from the 1950 population census, which was the first 
full census implemented in Finland. Karelians can be extracted from these data 
because there is information on the place of residence from the year 1939 which 
is prior to the initial evacuation. A constraint of this dataset, however, is that 
itis limited to variables available in the years 1939, 1950 and in follow-up data- 
sets from 1970s onwards. Studies conducted with these data have found that 
after the Second World War Karelian men had higher socio-economic standing 
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and higher income than their non-displaced counterparts.” Sarvimäki and col- 
leagues suggest that one reason for this is that Karelians were more likely to 
move from their initial placement areas to other regions in Finland in search 
of better employment opportunities than non-displaced Finns. This suggests 
at least one explanation for the finding that younger male Karelians reached 
a higher socio-economic position when compared to non-displaced Finnish 
males. In addition, displaced people transitioned faster from agrarian to mod- 
ern occupations than non-displaced Finns, which could have also affected their 
improved socio-economic standing.'' On the other hand, a study by Haukka 
and colleagues found that displaced Karelians had higher overall mortality and 
ischemic heart disease mortality than the rest of the Finnish population. It is 
interesting to note, however, that when compared to international research on 
the long-term effects of forced migration, Karelians had lower suicide rates." 

Previous studies have not used micro-level data to explore whether Karelian 
evacuees differed in their migration histories as a function of their background 
characteristics. Background characteristics of Karelians as well as environmen- 
tal factors of a host country may associate with likelihood of migration; fur- 
thermore, migration history could reflect the assimilation of Karelians. More 
specifically we explore: Who moved back to Karelia when they had the oppor- 
tunity? And who remained in western Finland? Which environmental factors 
affected the likelihood of return? How much, on average, did Karelians move 
after the second evacuation (that is, how easily did they settle after the Second 
World War)? What were the characteristics of the evacuees who moved most 
frequently and what factors predict faster assimilation? 


Karelian Evacuees 


Two separate wars were fought with the Soviet Union on the eastern border 
of Finland. The Winter War started on 30 November 1939, when the Soviet 
Union attacked Finland, and lasted until 13 March 1940. During this first war, 
Finland lost 11% of its land territory, including its second biggest city, Vyborg. 
The Soviet occupation of Karelia forced approximately 407,000 people to flee 
their homes and to be placed elsewhere in western Finland. 

Before 1950, Finland was predominantly an agrarian country, and agricul- 
tural occupations were even more common in Karelia than in other parts of 
the country. Approximately 230,000 evacuees (57%) earned their living from 
agriculture. Not all of them were farmers, however, and some were agricultural 
workers who did not own the farm they worked on. These farms were, on aver- 
age, smaller than farms in other parts of Finland." 

The initial placement of the evacuees was poorly planned and organised due 
to the sudden start to the Winter War and rapid advance of the Soviet troops. 
Migrants were initially housed in public buildings that were used as shelters 
and only later were transferred to private residences. In the summer of 1940, an 
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Emergency Settlement Act and compensation law were passed. With the settle- 
ment law, farmers could obtain new land to farm and, with the compensation 
law, the state would pay for the lost property. 

Land for evacuees was acquired from the state, the Church and municipali- 
ties, but it was also frequently seized from private owners. Although Finnish 
authorities attempted to carry out land acquisitions with voluntary purchases, 
many farmers were forced to sell their land. The purpose of the Emergency 
Settlement Act was not to ensure that Karelian farmers would be fully compen- 
sated for the land they had lost, but rather was to make sure that those Kareli- 
ans who made their living from agriculture could continue to do so." 

Between the Winter and Continuation Wars, evacuees who made their liv- 
ing from agriculture, especially those who had their own farms in Karelia, had 
the hardest time adjusting because the Emergency Settlement Act forced them 
to wait before they received land. This may have caused additional friction 
between evacuees and the host population because of the hard labour short- 
age with which evacuees were expected to help.'* But it was the placement of 
Karelian evacuees among Swedish-speaking Finns that aroused the most criti- 
cism. This was because placing Finnish-speaking Karelians in bilingual munici- 
palities could have endangered the delicate relationship between Swedish- and 
Finnish-speaking populations. The language question came to the fore once 
again after the Continuation War, when Karelians had to be settled perma- 
nently in the remaining parts of Finland." 

Because carrying out the Emergency Settlement Act was slow, only about 
13,000 new small farms were actually founded and only 6,000 of these contracts 
were finalised by the summer of 1941, even though many more applications were 
received. With the onset ofthe Continuation War in the summer of 1941, Finn- 
ish troops reconquered the Russian occupied regions, which gave Karelians the 
opportunity to return to Karelia. Evacuees who had received emergency settle- 
ment farms were then allowed to cancel their contracts; more than half of them 
did so by March 1943 and returned to Karelia. Nevertheless, a few hundred 
households kept their emergency settlement farms and gave up their claims on 
their land in Karelia.’* Approximately 70% of the original evacuees (280,000) 
who had initially settled elsewhere in Finland voluntarily moved back to their 
previous home in Karelia, while the remaining 3096 decided to remain in their 
new location. The number of evacuees who returned was higher in some loca- 
tions of origin (for example, over 8096 for Sortavala) and lower in others (for 
example, 40% to 60% for Viipuri). Farmers were more likely to return (~75%), 
and although returning to locations near the front line was not allowed, some 
disobeyed and returned anyway. A long period of trench warfare kept the front 
line quite stable from January 1942 until summer 1944 when the final Soviet 
offensive began.” 

The Continuation War ended in the autumn of 1944 and the border was 
redrawn back to where it had been in 1940 and everyone who had returned to 
Karelia between the wars was evacuated once again." This time, the evacuation 
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and placement of evacuees were much more systematic than they had been 
after the Winter War. In May 1945, the Parliament approved the Land Acgui- 
sition Act (Maanhankintalaki), which guided the settlement policy.” Accord- 
ing to the Act, groups that were entitled to receive land were evacuees who 
had made their living from agriculture, disabled soldiers, war widows, war 
orphans, soldiers who had served on the frontline or had a family and several 
other smaller groups. Evacuees submitted almost 48,500 applications for land; 
92% of these were accepted and evacuees were placed in certain initial place- 
ment areas. In the summer of 1945, they started to move to other places, partly 
because they were ordered to and partly due to their own initiative. The official 
placement plan only applied to the agricultural population, which meant that 
townspeople and industrial workers were free to choose where they wanted to 
settle. Resettlement of the agricultural population was based on the idea that 
people from the same villages would be able to stay in the same areas and 
that their placements would correspond to the climatic, economic and religious 
circumstances of the area from which they were evacuated. The official place- 
ment plan was only applied in its strictest form to farmers, and among them 
those who were entitled to farm. This constituted about 35% of all evacuees. 
Although they were in the minority, the final resettlement plan resulted in most 
of the farmers having to move again. As a result, in the years immediately fol- 
lowing the war, movements may have been more prevalent among farmers than 
other evacuees.” 


Material and Methods 


Here we use the recently digitised Migration Karelia (MiKARELIA) database, 
which contains over 160,000 adult Karelians and a wide range of data on births, 
marriages, occupations and movements of these forced migrants. The original 
source material for the database comes from a register compiled in the book 
series Siirtokarjalaisten tie (Anon. 1970; the title directly translates to: Karelian 
migrants’ road), which systematically recorded the experiences of evacuees. 

Interviews took place between 1968 and 1970 and were performed by approx- 
imately 300 trained interviewers. Each entry lists the full name (maiden name if 
applicable), profession, birth date, birth place and all movements (towns or cities 
of residence) from birth until the date of the interview, as well as their spouses 
names, professions, birth dates, birth places and years of marriage for those who 
married. Childrens names, birth years and birth places are also listed. These 
basic demographic data are presented in a standardised format for each entry. 
There is a variety of other data as well, including, for example, whether men had 
served in the army during the war and whether women had participated in the 
Lotta Svárd organisation (an all-female paramilitary organisation). 

The resulting registers contain a vast amount of data on the Karelian migrants, 
but in book format, they are poorly suited to quantitative analysis. Therefore, 
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a project was initiated to digitise these data, which ultimately resulted in the 
generation of the MiKARELIA database. Data entries were scanned at 300 dpi 
using a Canon c5250i copier and saved in pdf format. ABBYY Fine Reader 12 
(ABBYY production LLC 2013) was used to scan pdf documents for optical 
character recognition (OCR) and the output saved in html format. An open 
source software program? was written to convert Fine Reader produced html 
files to a simpler xml format containing the data entries. The program reads 
and extracts the source text to produce a JSON file containing all extracted 
data. These data can then be used to populate a structured database.” 

Obviously, the MiKARELIA database represents only those Karelians who 
were alive in 1968 to 1970 when the interviews were conducted. However, 
Loehr and colleagues have estimated that these data include records on approx- 
imately 7596 of the Karelian migrants who were alive at this time. Therefore, 
MiKARELIA can be considered to be a population-based database and not just 
a statistical sample of Karelians.* The MiKARELIA database is being further 
improved and replenished by combining it with other datasets (for example, 
the Karelia database ‘Karjala-tietokanta, which contains digital demographic 
information from about 70 parish registers of the ceded Karelia from the end of 
the 17th century until the start of Second World War). 

One key advantage of the MiKARELIA database, for example, as compared 
to the Statistics Finland 1096 sample data from the 1950 population census, is 
that while in the sample data individual level variables (for example, migration 
of Karelians) are only available for the year 1950 and from 1970 onwards,” 
in MiKARELIA there are individual level data on evacuees during the Second 
World War (for example, whether they served during the war and whether 
they returned back to Karelia during the Continuation War). Therefore, the 
MiKARELIA database offers excellent opportunities to explore with consider- 
able detail the migrations of Karelians during and after the Second World War 
in addition to a variety of socio-demographic and environmental factors that 
were associated with their decisions to migrate. 

To determine whether Karelian evacuees differed in their migration histories 
as a function of their background characteristics, the current study involved 
analysing the already existing MiKARELIA database and combining it with a 
database on the location of all the cities and towns involved in the evacuations, 
and their population sizes. Populations of towns located in Finland and Karelia 
were obtained from the Statistical Yearbooks of Finland 1939.” In addition, for 
each place, we obtained coordinates to locate them on the map and calculate 
the effect of several geographical dimensions on the probability of returning 
home during the war (1941-1944). To do this, information was gathered from 
multiple sources on the internet and maps. Most of the coordinates could be 
found directly from the history books of the Suomen sukututkimusseura (the 
Finnish Genealogy Society), while the rest were searched from Google Maps— 
a map utility served by Google Incorporation (Google Maps, Finnish Geneal- 
ogy Society). 
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Sample selection 


Although interviewees provided some information on other members of their 
family (for example, spouses and children), in our analyses we focused solely on 
individuals who were interviewed and on which we had the most complete and 
systematically recorded information. Thus, the statistical unit for this research 
is the family, rather than each family member separately, given that families 
were presumed to have moved together. In addition, children (individuals 
who were born after 1925) were excluded from these analyses. These individu- 
als would have been 15 years old or younger in 1941 when the first opportunity 
to return to Karelia was possible. The birth location, rather than the location 
in Karelia at the moment of the evacuation, was used because the location of 
the evacuees immediately prior to the evacuation was only recorded for a small 
subset of individuals, whereas birth place was available for more than 90% 
of the total sample. Finally, only those who were born in Karelia were chosen. 
These selection criteria left us with a sample of 59,477 Karelian evacuees. Each 
population size parameter (birth population, population of first destination 
in Finland and population of return destination in Karelia, which was used in 
the maps) was log transformed for reasons of statistical inference (that is, the 
effects of population size are not expected to be merely additive) and to aid fit- 
ting the models. 


Variables 


As a dependent variable, we use a binary variable: whether an individual 
returned to Karelia or not (0 = no, 1 = yes). In our sample, 52% returned, which 
is a somewhat lower number than the overall proportion of evacuees who 
returned, which was reported to be approximately 70%. This may be because 
we are both missing the oldest Karelians who might have been more likely to 
return than the younger ones (and had died by the time of the interviews in 
1968 to 1970), and also missing those who were children (less than 15 years 
old) during the war. Our other dependent variable is the total number of moves 
after 1945 and up until 1970. In our sample, Karelians had on average 1.02 (SE 
0.005) moves, which varied between 0 and 19. The majority, however, moved 
at least once (5496). 

As independent variables we use: sex, age, occupation in 1970 (farmer or 
non-farmer; we are expecting that farmer was the most 'stable' occupation, that 
is, one can assume that they were already farmers in Karelia), whether he or she 
had children in 1940, longitude and latitude of birth location, longitude and 
latitude of first destination in Finland and population size in birth location 
and first destination in Finland (see Table 8.1 for descriptive statistics). 

We used generalised linear logistic and Poisson regression models to ana- 
lyse these data. In the case of returning to Karelia during the Continuation 
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Table 8.1: Descriptive statistics for those who returned Karelia and those who 
did not (96/mean, (SE)). 


Returned Karelia 1941-1944 
Yes No 
Sex (96) 
Women 53.1 46.9 
Men 52.2 47.8 
Age (mean) 30.7 (0.06) 28.9 (0.06) 
Farmer (96) 
Yes 73.5 26.5 
No 42.9 57.1 
Have children (%) 
Yes 60.6 39.4 
No 48.5 51.5 
Birth destination longitude (mean) 29.8 (0.01) 29.7 (0.01) 
Birth destination latitude (mean) 61.0 (0.004) 60.9 (0.005) 
Destination longitude (mean) 24.8 (0.02) 25.3 (0.02) 
Destination latitude (mean) 61.9 (0.01) 61.6 (0.01) 
Destination population size (mean) 18280.1 (351.6) 43327.9 (743.5) 
Birth population size (mean) 12559.33 (116.9) 17915.5 (212.3) 


Note: Demographic variables n = 49,780; environmental variables n = 29,622. 
Source: Authors. 


War, we explain our binary dependent variable (returned to Karelia = 1, did 
not return = 0) using a logistic regression and the coefficients of the predic- 
tors are interpreted as odds ratios. An odds ratio above 1 indicates a greater 
likelihood of the event compared to the reference category, and an odds ratio 
below 1 indicated a smaller likelihood when all other covariates entered into 
the model are held constant. To model the number of moves after the Continu- 
ation War, we used both a logistic regression (no moves = 0, at least one move 
=1) and a Poisson regression which fits these count data (namely, the number 
of moves) better than a normal distribution. Poisson regression coefficients can 
be interpreted in a similar manner as linear regression coefficients such that 
negative coefficients indicate a negative relationship and positive coefficients 
indicate a positive association with the outcome variable when all other covari- 
ates are held constant. 
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Results 


The evacuees in the MiKARELIA database we used here were from areas west 
and north of Lake Ladoga, which are the regions Finland lost to the Soviet 
Union. The distribution of evacuees’ homes at the time of evacuation are illus- 
trated in Map 8.1. Geographically, the top destination as well as the rest of the 
distribution are generally widely spread across southern Finland (see Map 8.2). 
The distribution of returning evacuees to Karelia during the Continuation War 
is similar to the baseline in 1939, with a few exceptions in southern Karelia. The 
number of people who returned is, of course, fewer than the number who left 
from those same places (see Maps 8.1 and 8.3). 


Return to Karelia during the Continuation War 


Evacuees were spread across an area spanning 60 to 70 degrees latitude, with 
most people concentrated in the south, especially in areas below 64 degrees 
latitude (see Map 8.2). Map 8.2 shows the percentage who returned from each 
evacuation destination in western Finland. Here, it is evident that the degree 
to which the return rate depends on location is complex. However, return rates 
below 50% are more common at lower latitudes and a higher percentage of 
people returned to northern parts of Karelia. Evacuees were spread quite evenly 
across an area of Finland spanning 19 to 31 degrees longitude. At more western 
longitudes, the proportion of those who returned to Karelia is greater. 

Overall, Karelians placed in northern and western Finland were more likely 
to return. Evacuees were spread across towns and cities of varying populations, 
but those evacuated to the areas in the largest category (population size greater 
than 20,000), fewer than 60% returned. No other relationships between popu- 
lation size and return rate were obviously evident (Map 8.2). 

Results shown in Table 8.2 are from a two-stage stepwise logistic regression 
model in which the dependent variable is whether or not a person returned to 
Karelia, and the independent variables are added to the model in two stages: first, 
all socio-demographic variables and, second, all environmental variables. Results 
from Model 1 show that men were less likely to return than women. In subse- 
quent sensitivity analyses, this difference disappeared, however, once the fact that 
many men were serving in the army was taken into account (results of sensitivity 
analyses are not shown in Table 8.2). In addition, results suggest that age was 
not a significant predictor of returning to Karelia. Being a farmer was, however, 
and the predicted probabilities (calculated from odds ratios) of returning for 
farmers was 73% as compared to non-farmers, which was 43%. (Note: In Model 
2, which also takes into account environmental factors, the probabilities were 
76% and 53% for farmers and non-farmers respectively.) Therefore, the adjusted 
probabilities did not differ much from the unadjusted distribution (see Table 8.1). 
Having children was also positively associated with returning to Karelia. 
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Map 8.2: The proportion of evacuees who returned to Karelia from towns in 
western Finland during the Continuation War. Source: Authors. 
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Table 8.2: Association between socio-demographic and environmental factors 
with the likelihood of returning to Karelia during Continuation War. 


Model 1 Model 2 
9596 CI 9596 CI 
OR SE p lower upper OR SE p lower upper 
Sex 

Women (ref.) 

Men 0.83 0.02 0.000 0.80 0.86 0.84 0.02 0.000 0.80 0.88 
Age 1.00 0.001 0.091 1.00 1.00 101 0.002 0.000 1.01 1.01 
Farmer 

No (ref.) 

Yes 3.64 0.08 — 0.000 3.49 3.79 2.78 0.08 — 0.000 2.61 2.95 
Have children (96) 

No (ref.) 

Yes 1.38 0.03 0.000 1.32 1.44 1.19 0.04 0.000 1.12 1.27 
Birth destination longitude 0.78 0.02 0.000 0.75 0.81 
Birth destination latitude 2.31 0.08 0.000 2.15 2.48 
Destination in Finland longitude 0.89 0.001 0.000 0.88 0.90 
Destination in Finaland latitude 104 0.01 0.002 1.01 1.06 
Destination population size 0.74 0.01 0.000 0.73 0.76 
Birth population size 0.98 0.02 0.126 0.95 1.01 
n 49,780 29,622 
McFaddens Adj r2 0.067 0.102 


Note: Results from two-stage stepwise logistic regression. Source: Authors. 


Results of Model 2 (Table 8.2) take into account several environmental vari- 
ables, in addition to socio-demographic variables. Results from this regression 
model support the conclusions drawn from the maps shown above: people from 
more western and northern birthplaces were more likely to return and evacuees 
who went to more westerly and northerly destinations in Finland were more 
likely to return. In addition, the population size of the destination town or city 
in Finland was significantly and negatively associated with the likelihood of 
returning to Karelia during the Continuation War. In other words, people placed 
in less populated areas were less likely to remain and more likely to return to 
Karelia. Taking these environmental factors into account did not alter the effects 
of socio-demographic factors, although age was significantly and positively 
associated with returning, meaning that older people were more likely to return. 

Map 8.3 indicates that evacuees from the larger populations have relatively 
fewer people returning, although the association was not statistically signifi- 
cant in the regression model. In addition, places located nearer to the front line, 
especially in the Karelian Isthmus, had relatively fewer people returning. 


Migration after the Continuation War 


Our second dependent variable considered the number of moves after the Con- 
tinuation War. First, we investigated those Karelians who had moved at least 
once after their first placement (Table 8.3). Here, men were more likely than 
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Map 8.3: Proportion of evacuees who returned to their natal locations in 
Karelia. Source: Authors. 


women to move at least once. Also, the younger these individuals were, the 
more likely they were to move multiple times. Being a farmer was also posi- 
tively associated with the likelihood of moving at least once. In addition, those 
who had returned to Karelia during the Continuation War were more likely to 
move at least once after the war than those who did not return. 
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Table 8.3: Socio-demographic factors and whether or not individuals returned 
to Karelia associated with the likelihood of Karelians to move at least once 
after Continuation War 


95% CI 
OR SE p lower upper 

Sex 

Women (ref.) 

Men 1.26 0.02 0.000 1.22 1.31 
Age 0.99 0.001 0.000 0.98 0.99 
Farmer 

No (ref.) 

Yes 1.11 0.02 0.000 1.06 1.16 
Returned Karelia 

No (ref.) 

Yes 2:22 0.04 0.000 2.14 2.31 
n 49,241 
McFaddens Adj R2 0.034 


Note: Odds ratios from logistic regression model. Source: Authors. 


Second, we examined whether the same socio-demographic factors were 
associated with the frequency of moves among Karelians after the Continua- 
tion War (Table 8.4). As was the case with any moves, men, younger Karelians 
and those who returned to Karelia during the Continuation War were all more 
likely to move more after the Continuation War than women, older Karelians 
and those who did not return to Karelia when they had a chance. However, 
farmers were less likely than non-farmers to move more after the Continuation 
War. This was the only factor that was differently associated with moves when 
compared to the previous model (Model 1). 


Discussion and Conclusions 


Our primary aim in this chapter was to study how the migration histories of 
Karelian evacuees during and after the Second World War were influenced by a 
variety of social, environmental and demographic characteristics. Which evac- 
uees were more likely to move back to Karelia when they had the opportunity? 
Which environmental factors influenced an individual's decision to return or 
remain? How many times, on average, did the Karelians move after the second 
evacuation and who moved the most and who settled the fastest? 
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Table 8.4: Socio-demographic factors and whether or not one returned to Kare- 
lia after the Winter War is associated with the frequency of moves among 
Karelians after the Continuation War. 


95% CI 
coeff. SE p lower upper 

Sex 

Women (ref.) 

Men 0.10 0.01 0.000 0.09 0.12 
Age -0.02 0.0005 0.000 -0.02 -0.02 
Farmer 

No (ref.) 

Yes -0.16 0.01 0.000 -0.18 -0.14 
Returned Karelia 

No (ref.) 

Yes 0.40 0.01 0.000 0.38 0.41 
n 49,241 
McFadden’s Adj. R2 0.022 


Note: Coefficients from Poisson regression model. Source: Authors. 


We used the new MiKARELIA database which has unique individual level 
information on moves of Karelians during and after the Second World War. 
Our results are mainly in line with previous studies, although very few of 
these have concentrated on factors related to returning to Karelia during the 
war. We found that both socio-demographic and environmental factors were 
associated with returning to Karelia during the Continuation War. 

In detail, we found no sex differences in the likelihood of returning once we 
took into account the fact that many men were serving in the army during the 
Continuation War. Previous studies? have shown and this study confirms that 
farmers were more likely to return than non-farmers. In addition, once 
environmental factors are taken into account, the models show that older 
individuals were more likely to return than younger ones. This suggests that 
those Karelians who were in a more stable phase of life and who were probably 
more attached to their home districts (for example, had family and land and 
were older) were more likely to return to Karelia. Environmental factors also 
made a difference. People placed in more westerly and northerly destinations in 
Finland were more likely to return,” while at the same time Karelians who were 
from more western and northern birthplaces were also more likely to return. In 
addition, evacuees who were placed in smaller towns were also more likely to 
return. Importantly, these environmental factors, which had been documented 
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previously in other studies, were still significant predictors of returning to 
Karelia even when socio-demographic variables were controlled for. A major 
advantage of this study is that we can take several characteristics into account 
at once and draw the conclusion that, for instance, occupation (namely, being a 
farmer) did not alone explain the variation in returning to Karelia. 

While analysing the number of moves after the Continuation War, we discov- 
ered that nearly half of the Karelians actually settled permanently in their first 
location (46%). Those who had moved at least once were also more likely to be 
farmers. This was probably the result of the resettlement policies and the Land 
Acquisition Act?! which required farmers to wait to acquire their own 
land. However, after this initial displacement, the farmers were less likely than 
others to move, which suggests that they probably settled the fastest. Finally, 
younger people, men and those who did return to Karelia during the war were 
more likely to move at least once after the war ended and these evacuees also 
moved more overall. The positive association between evacuees who returned 
to Karelia and subsequent movements after the war ended is particularly inter- 
esting because it seems to contradict our findings on the characteristics of 
those who returned and those who moved more after the war. In other words, 
although farmers were more likely to return to Karelia when an opportunity 
came and individuals who returned were more likely to move after the war 
ended, those who moved more after the war were also less likely to be farmers. 
This indicates that there may be yet-to-be-determined factors influencing the 
relationship between returning to Karelia and geographic mobility after the war 
and suggests that these relationships need to be investigated further. 

The main strengths of this chapter are that we were able to utilise individual 
level data on a large number of Karelians to study their migration during and 
after the Second World War. A key advantage of this kind of database and the 
methods used in this chapter are that we were able to simultaneously take into 
account several factors that are associated with the frequency of migration. 

The main limitations of this study are data related. For example, we do not 
have data on the oldest Karelians and currently we only have occupations for 
people in these data from when they were interviewed in 1970. These issues 
are related to the fact that the original data were collected in 1968 to 1970. 
However, a crucial advantage of having digitised these data is that we can in 
the future continuously update and supplement these data with other source 
material and merge them with other large quantitative databases available for 
the Finnish population. 

Future studies could investigate more closely the migration profiles of dif- 
ferent sub-groups of Karelians. For instance, what happened to those farmers 
who did not return to Karelia during the Continuation War or to those evacu- 
ees who settled in their first location after the Continuation War? By examin- 
ing more closely the movements of different groups of Karelians, we may also 
explore how the early settlement of evacuees is linked to the long-term out- 
comes associated with forced migration.? 
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CHAPTER 9 


Towards Digital Histories of Women’s 
Suffrage Movements 


A Feminist Historian’s Journey to the World of 
Digital Humanities 


Heidi Kurvinen 


Introduction 


During the past decade, the amount of digitised material has exploded,’ but 
they are not available to all researchers in an equal manner. The entrance 
to the field of digital humanities requires cultural and technological capital 
which excludes or marginalises researchers who do not have the skills to con- 
duct digital analyses by themselves or do not have access to the organisational 
support. According to Matthew K. Gold, it is research-intensive universities 
containing sufficient financial and human resources that have been able to 
embrace the digital turn.? Again, this ability to focus on digital research and 
hire personnel to carry out the analysis has formed a ‘circle of good’ stabilis- 
ing their status within the field. Other universities, not to mention individual 
researchers, have been less fortunate, but simultaneously digital analysis tools 
increase the expectations that we as scholars are expected to accomplish.’ 
Gender is one of the factors that seems to affect the ability of researchers to 
take part in the digital turn. Farida Umrani and Rehana Ghadially discuss the 
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aspect of empowerment that is connected to the access of computers for women 
by using the division between ‘information rich and ‘information poor’ peo- 
ple.* Even though their approach is connected to third-world countries and 
the use of computers in general, I find the division useful also in the context of 
digital humanities. Similarly, researchers within humanities and social sciences 
are nowadays divided into those who have skills and access to digital research 
methods and those who don't. Based on stereotypical gender role expectations 
(such as viewing technology as masculine coded) with which most genera- 
tions of current historians have been raised, men have often better opportuni- 
ties to explore the field even if the starting point is the same with their female 
colleagues. The reason for this, as Miriam Posner has pointed out, is that mid- 
dle-class white men are more likely to have been encouraged to explore with 
computers at a young age than women or other marginalised groups.° At the 
same time, most present-day researchers are regardless of gender already to 
some extent participating in the digital turn as the mixing of traditional and 
new (digital) research practices have become a self-evident part of our present- 
day work as scholars, in the form of digital voice recorders, digital cameras and 
the use and analyses of digital texts and images.* 

In this chapter, I will discuss what is needed when a historical scholar with 
limited digital skills wants to take a step towards learning how to conduct digi- 
tal analyses, towards becoming a digital historian. As a feminist historian, I will 
combine this approach with a discussion of the relation of feminist research 
and digital humanities. In line with practice in feminist research, I will be using 
a self-reflexive approach and asking how the increase in the understanding of 
digital methods influences our research questions in feminist history. Do digi- 
tal humanities tools transform our work as feminist historians? How can digital 
analyses develop the field of gender history in general and the history of femi- 
nism in particular? Can a scholar who has limited technological skills engage 
with an informed and critical discussion with digitised materials? 

Even though the main points of my chapter apply to all historical research, 
a focus on gender analysis is worth making as gender seems to have remained a 
rather limited category of analysis among digital historians. And although not 
all gender historians identify their work as feminist, there is a strong connec- 
tion between the two,’ which makes the discussion of the relation between 
feminism and digital analysis a valid starting point for this chapter. The inten- 
tion is not, however, to claim that there is a clear difference between feminist 
history and historical research in general, but to participate in the discussion of 
the meanings of feminist approaches to digital humanities and ponder why in 
particular feminist historians should be part of these discussions. 


Feminism and Digital Humanities 


At first, the combination of feminist research and digital analysis may seem as 
strange bedfellows, but for the past decade feminist digital humanities research 
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has been conducted in various fields in the Anglo-American world, in particu- 
lar, and scholars have engaged in critical discussion of the relation between 
feminism and digital humanities. Some scholars have problematised this rela- 
tion by deconstructing the gender-neutrality of digital analysis tools in order to 
find ways to overcome the divide between male producers and female users of 
computational tools.’ Others have asked how the use of a large amount of data 
fits with feminist research that relies on gender-sensitive reading.? Scholars 
have also found similarities between feminist research and digital humanities 
approaches, such as collaborative research. According to Janine Solberg, femi- 
nist research and digital humanities may even form a fruitful pair since the idea 
of an ethical (feminist) researcher encourages the scholar to be open for multi- 
ple viewpoints and to position oneself as a researcher and conduct the research 
in a transparent manner (features that are also valued in digital humanities)."° 

Discussion on a feminist approach to digital humanities has mainly focused 
on pondering how gender, race and other marginalising factors can be taken into 
account when compiling datasets and digital archives. In addition, the respon- 
sibility of feminist scholars to unsettle digital humanities’ ‘retro-humanist’ 
practices that maintain a canonical understanding of what is relevant to digitise 
have been pointed out."! However, less has been written about the actual method- 
ological practice of conducting a feminist digital humanities project. Neverthe- 
less, there are some exceptions. In her insightful article on the US suffragette 
Frances Maule, Solberg, for example, points out how the new technology made 
it possible for her to find information about Maule, whose life was relatively 
unknown when she discovered her. At the same time, new information that 
she was able to find thanks to digitised material changed the interpretations of 
Maules texts used by Solberg in her work.” Thus, digitised data can help us to 
find sporadic information of our research subject and combine these pieces of 
information more easily than previously. Mass digitisation may also widen our 
opportunities to find traces of people who have been marginalised in the past 
or purely forgotten, which is consistent with the core ideas of feminist research. 

In spite of the existing literature on feminism and digital humanities, feminist 
digital history seems to be an under-discussed area of study. For instance, 
scholars of feminist historiography of rhetoric, Jessica Enoch and Jean Bessette, 
argue that feminists have used digitally born materials to study womens lives, 
but historians have rarely pondered how digital methods could widen their 
scope of study.? A few exceptions have appeared in feminist literary history, 
in particular, but the field is still narrow. This seems problematic because the 
digital turn has already started a revolution in history which will potentially 
profoundly change our scholarship and require us to learn new tools, as Alexis 
Lothian and Amanda Phillips have formulated." Also due to this, gender- 
sensitive historians should start to pay attention to the challenges and opportu- 
nities that the digital turn will cause in our field. 

The argument follows Solberg as well as Enoch and Bessette, who have 
demanded more discipline-specific discussions on the role of digital analysis 
tools and digital research materials.^ According to my understanding, a focus 
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on gender history is particularly important because, as an already marginalised 
sub-field ofa traditionally masculine discipline, it may not otherwise be able to 
answer the demands of financers that have started to highlight the importance 
of digital methods. In addition, gender sensitivity is needed to guarantee that 
digitisation of archival sources and other material such as print media or books 
do not focus solely on canonical pieces or notable people of history. Even though 
gender has been written into history increasingly since the 1960s, digitised 
collections often maintain gender bias when mens history tends to be viewed 
as more important. Furthermore, a move towards more digitally aware research 
may require that gender historians, alongside gender scholars in other fields, 
start to discuss how women’s and other minority groups’ abilities to conduct 
digital humanities research could be supported. However, the intention of this 
chapter is not to strengthen the essentialist notion of women as less capable 
than men of conducting digital analysis, but by using my own research field as 
a starting point to instead problematise what is needed to support scholars such 
as myself who do not have the basic digital skills, making it difficult to start on 
their own. 


Taking the First Steps in the World of Digital Humanities 


In Finland, the first computer started to operate in 1958 and a relatively rapid 
computerisation has taken place in the country since the 1960s. This has also 
had its effect on research. Historian Viljo Rasila was already writing about 
computer-assisted research in 1967 and used these kinds of methods in his 
work. However, when I began my studies at university as a fresh under- 
graduate student in 1999, I did not own a computer, and neither did many 
other students at this time. Computer-assisted methods did not belong to the 
curriculum and in spite of the accelerating computerisation of the Western 
world, for many they remained primarily a tool for writing and for using pub- 
lishing or photo programs. This applied also to my relationship with comput- 
ers, which explains why I never learned to understand properly how computers 
work as operational systems. For me, computers remained tools that I used to 
write and I knew only as much of them as was needed to complete that task. 

As a scholar who began her postgraduate studies in the mid-2000s, I was even 
able to conduct my PhD studies without ever hearing the words digital humani- 
ties. I first became familiar with the field as late as in 2015 when editing an arti- 
cle on that matter for a Finnish peer-reviewed journal as part of my duties as a 
sub-editor. I became instantly intrigued; but for a person with limited IT skills; 
it felt overwhelming to even try to figure out how I could use the approach in 
research. As for many, I assume, the first push towards this took place after a 
year while I was writing a research proposal for a major financing body and 
tried to figure out how I could elevate the state of art so that my project would 
be successful in receiving funding. I started to read digital humanities literature 
and tried to understand what all this could mean for my project. 
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By reading the texts, I started to realise that digital humanities projects were 
often collaborative initiatives (that is, not everyone needs to know how to 
code). However, at that time, I was working as a visiting scholar at a Swedish 
university with a Finnish research grant, which meant I found it difficult to 
start looking for collaborative partners. How could I even start to look for 
them, I was asking myself. In short, I lacked the institutional support as well 
as technical skills that would have helped me to explore the field on my own. 

For a year, every now and then, I read articles on digital humanities, some of 
which made more sense to me than others. I was, for instance, exhilarated to 
find out that distant reading could be combined with close reading, the latter 
of which is the method I am most familiar with. However, the basic question 
remained the same: How could I start to understand the process of the analysis? 
Simultaneously, I changed universities and my new colleagues helped me to find 
pages that offered guidance for different analysis tools, such as The Program- 
ming Historian, but it still felt difficult to start learning on my own. Luckily, in 
the spring of 2018, I was selected for the practical course on digital humanities 
organised by the project ‘From Roadmap to a Road Show led by Mats Fridlund. 
It turned out to be the first step towards understanding what digital humanities 
could actually mean for my historical research. As my project, I selected a small- 
scale case study of the use of the word ‘naisasialiike’ (women’s suffrage move- 
ment) in the biggest national newspaper Päivälehti (1889-1904) / Helsingin 
Sanomat (1904—) at the turn of the 19th and 20th centuries. 


Learning by One's Mistakes 


Janine Solberg has argued that digital environments can be used as safe spaces 
to test our research ideas." In her work, Solberg did not rely on big data, but 
used digitised material to trace pieces of information of her research subject. 
However, her argument also seems suitable for a scholar who combines a big 
data approach with close reading of a relatively small pool of data, as is the case 
in this chapter. 

Previously, I had used the National Library of Finland's digital newspaper 
archive from time to time to look for information. However, I had only used 
the search option, without trying to familiarise myself with the platform. Due 
to this, I had the habit of writing down the texts that interested me and it was 
only when preparing the data for the course that I realised that an OCR view of 
the text would make it easier to gather the data. However, from the literature, I 
had learned that not all letters would necessarily appear the same in the OCR 
text as they were in the original.'* This was painfully clear concerning the mate- 
rial from Päivälehti, in which the articles were published by using fraktur, an 
old German font type. For instance, the machine was unable to recognise the 
letter ‘w as such, but it was often written with ‘m or with a combination of 
two letters such as ‘r and ‘i. Also similar kinds of errors took place with other 
letters such as ‘s, which had different kinds of typographical variations in the 
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original text. Furthermore, wider spacing of words which was used to high- 
light certain words in the newspaper texts, such as surnames, caused trouble 
for the machine. More importantly, the machine was unable to recognise the 
newspaper columns which ran in an uneven manner at that time. This meant 
that the OCR view did not include texts in the order they were mentioned, and 
I had to copy the paragraphs manually. This included removing paragraphs 
that were not part of the article I was interested in and ascertaining that all the 
paragraphs of my article had been copied to the file. Due to these problems, 
the OCR view made it only slightly easier to prepare the material than my origi- 
nal manner of writing everything manually. While the newest articles might 
had had only a couple of errors, older texts were often impossible to under- 
stand based on the OCR. This meant that I had already read through some of 
the texts while correcting the OCR, making me familiar with the material. The 
same kinds of problems have also been noticed by other scholars who have 
problematised the idea of digital analysis as a rapid way to conduct research 
with newspaper material. 

My first experience with any kind of a data analysis software took place in 
May 2018, when I participated in the earlier-mentioned digital history course. 
Based on my project abstract, in which I had suggested that I would use a sta- 
tistical natural language processing tool called MALLET for carrying out topic 
modelling analysis of the data, I had been assigned to a MALLET group led by 
the digital historian Mila Oiva. In addition, Juho Savela was providing techni- 
cal support. My first challenge on the course was to learn to understand what 
can be made when using the command prompt of my computer. After that, I 
learned some basic commands for MALLET which helped me to start playing 
around with the material. Thus, the course gave me a basic understanding of 
how the command prompt functions worked and what I as a researcher could 
do with the material by using MALLET. However, the process of writing this 
chapter has been a test in which I have used MALLET and my own computer 
as a safe space for learning more by using the ‘learn by your mistakes method. 
Gradually, this has deepened my understanding of the process, even though 
there are still many things I do not understand. 

One of the major revelations during the process has been that combining 
digitised material with technologically assisted analysis needs suitable research 
questions. According to Solberg, digital tools change our ways of discover- 
ing, accessing and making sense of the past. To be more specific, digital envi- 
ronments can reorientate us both physically and conceptually’ if we choose 
to be active technology users instead of remaining as passive users of them.” 
Similarly, Jacqueline Wernimont has defined the division of male creators and 
female users of digital tools as one of the critical questions that feminist digi- 
tal humanities needs to address.?' Furthermore, other feminist scholars have 
engaged in critical discussion of what is enough to make the field more diverse 
and whether the ability to code is a necessity for all digital humanities schol- 
ars.” In my case, the move towards a more active user of digital tools meant 
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that my original research questions changed during the process. I realised that 
MALLET would not be the best option to trace the transnational influences of 
feminist ideas as I had originally thought. Instead, it offered a window towards 
the variety of topics connected to which the word naisasialiike was used. In the 
following, I will outline the process of carrying out the analysis as well as pon- 
der whether digital analysis of a relatively small pool of newspaper articles can 
bring new information concerning the early feminist movement in Finland. 


The Importance of Search Words 


In the project plan, I outlined the research period to cover the years between 
1889 and 1929. I used Päivälehti's first publishing year as a starting point for 
the search period because the first women’s organisation Finsk Kvinnoforening 
—Suomen Naisyhdistys had been established five years earlier in 1884.” The 
period ends in 1929, when the New Marriage Law was approved in Finland, 
forming one kind of an end point for the early feminist movement.” When 
outlining the period like this, I assumed that it would consist of a reasonable 
amount of data that I could use in my analysis. Surprisingly, the number of texts 
using the word naisasialiike proved to be relatively low. The search brought 51 
results, one of which was a list of literature for Christmas presents. Because 
the list contained only one item that was relevant for the theme, I did not take 
it into account. Other hits were mentions as part of larger articles dealing 
with various women’s organisations and their gatherings or books that dealt with 
womens issues. Additionally, the pool of data consisted of few notifications for 
meetings organised by the Womens Feminist Union (Naisasialiitto Unioni), 
among others. 

The search also brought to light other astonishing revelations. At first, the 
word naisasialiike seems not to have belonged to the newspapers vocabulary 
at all, since the first hit was from the year 1896 (12 years after the first women's 
organisation had been established). Before 1900, the word naisasialiike had 
been used only five times and continued to be used quite rarely until the 1920s: 
between 1900 and 1920, it appeared 12 times. Thus, it seems that the word made 
its breakthrough in the 1920s, even though it was still used only occasionally. 
This is slightly surprising because the 1920s was a relatively quiet period in the 
Finnish women's movement compared to earlier decades. 

One explanation for the concentration of the use of naisasialiike may 
be that, during the 1920s, it was used as a retrospective term to look back 
on the history of the women's movement. However, throughout the studied 
period, it appears also as an umbrella term that was used to refer to women's 
emancipatory demands of its own time. In other words, naisasialiike seems to 
have become a label that was used both by the women’s movement activists 
and their opponents, and it was accepted by the newspapers editorial office. 
Thus, the question remains: Why did the most active years of early feminism as 
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a movement in Finland not cause wider coverage in the pages of Püiválehti / 
Helsingin Sanomat? 'The question is particularly interesting because the word 
naisasialiike is the one that is commonly used in the research to refer to the 
early feminist movement in Finland. Due to this, it could have been assumed 
that the word had appeared in public discussion at the turn of the century. 

When looking for answers to the above-mentioned question, it is worth 
taking into account that the results would have been different if I had used 
different search words. For instance, Hieke Huistra and Bram Mellink have 
reminded us that a digital humanities scholar needs to choose the right search 
words to receive reliable results.” However, the problem is that a topic can be 
described with a variety of words that appear at different times, the meaning 
of which may change in different contexts and throughout the studied period. 
In Päivälehti / Helsingin Sanomat, the words feminismi (feminism) and 'femi- 
nisti (a feminist) received 11 hits respectively between 1889 and 1929, but the 
usage of them took place mostly before the 1920s. The word ‘naisliike (womens 
movement), which refers to all kinds of women’s organising (both feminist and 
non-feminist), received 255 hits, which suggests that issues relating to women’s 
status in society were not yet directly connected to feminist organising at the 
turn of the 19th and 20th centuries. The most comprehensive word was 'nais- 
asia’, which received 897 hits. It was used for the first time in 1890, after which it 
appeared continuously throughout the research period, suggesting that wom- 
ens issue as a theme was part of the public discussion of its time, but it was 
not connected to a specific movement per se. An analysis of the usage of the 
word naisasia would, therefore, give us a more comprehensive understanding 
of the early feminism in Finland, but the cleaning of the material would also 
require a considerable amount of work, which was not possible in the scope of 
this chapter. Furthermore, such big data would have made it difficult to use this 
process as an opportunity to reflect on the relation between feminist research 
and digital humanities. Comparison of the usage of various words dealing 
with womens issues nevertheless reveals the development of terminologies 
which has been shown to be one of the strongest sides of big data analysis. 
However, as Alex Mold and Virginia Berridge remind us, these kinds of results 
also need to be contextualised and triangulated with other sources/traditional 
research methods in order to receive a more accurate understanding of the 
results provided by the digital analysis.” 

As pointed out earlier, a close reading of research material is one of the corner 
stones of feminist research, and digitised computer reading of big data seems 
to be in contradiction to this. One solution to overcome this conflict is to com- 
bine computer-assisted analysis with close reading of the material or parts of 
it, as Johan Jarlbrink, Pelle Snickars and Christian Colliander have suggested, 
among others." Based on my small-scale project, a combination of distant and 
close reading is not only necessary to validate the results, but the combination 
also gives new perspectives on the material —one example of which is the con- 
nection between national and transnational discussion in Finnish feminism. 
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Previous scholars who have worked with archival material or used media texts 
in a more traditional manner have extensively shown how the so-called first- 
wave feminism was committed to national issues in Finland even though the 
early feminists, at the same time, had wide transnational networks. For instance, 
Alexandra Gripenberg saw women’s emancipation as necessary for human 
progress and therefore it had to be strived for universally. Simultaneously, her 
work towards women’s emancipation was tied with nationalism.”* Both sides of 
early feminism also appear in my data, which I recognised while cleaning the 
material for the analysis. However, the digital analysis also revealed different 
nuances in the dynamics between national and transnational aspects of early 
Finnish feminism, as will be shown in the last section of this chapter. 


How the Analysis Was Made 


The project was started by preparing the dataset of 50 texts for the analysis, after 
which I conducted topic modelling with 10, 15 and 20 topics. At first, the topics 
produced by MALLET seemed like a foreign language to me and the fact that 
every round of analysis could bring different kinds of word lists was puzzling. 
Even though I was mechanically able to make the right commands, the ability 
to start the analysis required a new way of interpreting the lists produced by the 
computer. This I could not have done without the guidance of Mila Oiva, who 
patiently used her own research as an example to walk me through the process 
of shifting my way of thinking. Learning a new way of interpreting the word 
lists was not the only challenge: I also had some problems with stop-words. 
Some of them kept popping up in the topics even though I had added them to 
the list. However, the final round of topic modelling offered satisfactory clean 
topics even if they still contained some of the listed stop-words, such as the for- 
eign words ‘del, ‘und’ and ‘des, as well as abbreviations such as ‘klo. Because the 
pool of data was small, I chose to make the analysis based on 10 topics, which 
brought the clearest image of the data (see Table 9.1). 

Three out of 10 topics pointed out to transnational exchange of ideas with 
words that referred to foreign countries in general or by name and to nation- 
alities or countries in plural. However, there were differences between these 
three topics. While the first one clearly referred to international connections in 
media in forms of news reports, the second one attached internationalism to 
the past of the women’s movement and the third topic connected international 
connections and women’s movement congresses, pointing to the transnational 
nature of the women’s movement. Other topics were clearly national by nature, 
but nationalism became a marker for only one of them which included the 
word ‘isénmaan’ (nation’s), for instance. 

Otherwise, the topics emphasised meetings of various womens organisa- 
tions and particular individuals such as Maikki Friberg. Four of the topics 
include words referring to men. Two of them seem to point to the negotiation 
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Table 9.1: Topic modelling with 10 topics. 


saawat warsin rahaston walittiin prior 
naisasia hywäksi woi des anne 
tiedekunnassa und erityinen 


Topic | Keywords Themes 

0 | kongressin ihmiset del ulkomailla suoma- | kansainvälisyys, kongressit, 
laiset teki sisältää ranskan lehdet pitäen uutisointi 
prete otettiin tavoin kansainliiton merkitsee | internationality, congresses, 
helsingissä tahi monessa aihetta italialaista | news coverage 

1 suomen osaston naisliiton lucina tohtori naisasiajärjestöt, kokoukset, 
puhui helsingin naisten ohjelma maikki raportointi 
friberg liiton olga lausui opettaja esitti womens organisations, 
hagmanin kuulla alkoi unionin meetings, reporting 

2 puhuja tehdä elämän naisen lapset suurta kokoukset, nainen, roolit, 
mies elämä piti yleinen miehet tehtävä mies, lapsi 
määrässä äiti olosuhteet elää nuori sai meetings, a woman, roles, a 
alustaja tehty man, a child 

3 naisten naiset saa saada työtä naisia miesten | nainen, roolit, julkinen elämä 
kodin osaa yhteiskunnan osa naisille pois a woman, roles, public life 
lopuksi maan saanut ulkopuolella 
muutamia olemassa nähden 

4  |owat nainen naisen naisasialiikkeen maissa | kansainvälisyys, ylirajaisuus, 
naisasialiike oliwat syntynyt toiminnan historia 
joukko wuotta naisyhdistyksen asema internationality, transna- 
eiwät työn maassa osasto omasta toimintaa tional, history 
kehityksen 

5 naisten suomen kotitalouden liiton hyväksi | kokoukset, kansainvälisyys, 
klo alalla esitelmä seurasi suomessa naisasiajärjestöt 
nykyjään kokous saksan liitto ohjelmassa meetings, internationality, 
kaikissa saapunut kansallisliiton suomi womens organisations 
esitelman 

6 mies professori miehen nim nainen naista mies, arvio, naisen rooli 
naisen voinut esittää perheessä mielestä a man, review, a womans role 
arvoa ensinkään prof suhteessa tunnettu 
pitää olevien määrin käy 

7 ibsenin tuli ibsen lapsia väkijuomien Ibsen, raittius, mies, naisen 
tyttöjen naimisiin perintönä isä ammatin rooli, äitiys 
runouden paloviinan lapsen jokaisen Ibsen, temperance, a man, a 
ominaisuudet vanha valtiopäivillä saivat womans role, motherhood 
selville vieläpä 

8 suomen kansan lasten laki maamme oikeus | naisten asema, äitiys, 
eduskunnassa rouva äänioikeus pitäisi kansallisuusaate 
asioissa itselleen lain yleisesti miehen womens status, motherhood, 
kansamme toimintaan isänmaan j.n.e tietä | nationalism 

9  |warten dagmar von die hywin sai wiime naisasia, keskeiset henkilöt 


womens issue, main persons 


Source: Author. 
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between women’s and men’s roles, while others bring men’s point of view to 
the women’s question to the centre. This can be explained with the variety of 
writers and their relation to the women’s movement. Texts that appeared in the 
regular column of the Women’s Feminist Union were most probably written by 
women’s movement activists themselves, as were some other texts published in 
the paper. However, other texts presented opinions of prominent men. Mother- 
hood is part of three topics, whereas temperance appears to be part of only one 
topic. The emphasis on women’s maternal roles seems accurate because, during 
the early 20th century, bourgeois women argued on behalf of a social moth- 
erhood locating motherhood as women's most important task in the society. 
According to Sulkunen and others, this bipartisan citizenship was supported 
by a majority of Finnish women by the 1920s.? On the contrary, the marginal 
role of temperance within the topics is slightly surprising. 

What do the above-mentioned outlines tell us about the data? First, as an 
unexperienced user of MALLET, as pointed out earlier, I began the process by 
staring at the list of keywords offered by the software without really knowing 
how I could interpret them. Due to this, I used the scattered ideas I had gained 
of the texts while carrying out the above-presented classification, even though I 
had not read all the texts with a similar intensity. Thus, a combination of digital 
analysis and a close reading of the material helped me to pinpoint the topics 
that might have otherwise remained unnoticed.? Second, the strength of the 
digital analysis is in its manner of presenting the data in a different form, which 
highlights certain patterns. In this case, it is evident that the discussion on the 
womens movement was conducted as part of the Finnish public sphere and 
nationally topical issues. Simultaneously, foreign countries served as a standard 
reference point and the women's movement appeared as a transnational phe- 
nomenon even though it was connected to national discussions. Third, the 
public discussion offered room also for men to define their stance towards 
the womens issue. Fourth, rather surprisingly, certain milestones in the 
development of womens status were not connected to the women's movement 
in the public debate. For instance, themes such as womens suffrage (1906) and 
the New Marriage Law (1929) did not raise discussion in which the word naisa- 
sialiike had been used. 

Thus, it is evident that digitised material has the potential to show us surpris- 
ing results, features that we don't expect to find from the material, as Mold and 
Berridge have pointed out.* However, to be able to understand these results 
more profoundly, they need to be contextualised both thematically and jour- 
nalistically. That is to say that computer-assisted analysis also needs a human 
to contextualise the results (an example of which are media texts that should 
not be seen as a number of separate articles, but instead as part of the publica- 
tion context of their time).? For instance, the length of articles had great vari- 
ation at the turn of the 19th and 20th centuries. Päivälehti / Helsingin Sanomat 
published short news and reports as well as extensively long congress reports, 
which were often several pages in length. Potentially, this affects the results as 
I assume has been the case with the discussion on temperance. Based on the 
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topic modelling, temperance had a relatively minor role in the public discus- 
sion, but based on close reading of the data it was a recurring theme within the 
material. However, it was not discussed as part of the longer articles, but only 
mentioned briefly in other texts. Elsewhere, I have also argued the need to bear 
editorial practices in mind while historians use media texts to make interpreta- 
tions of past phenomena. This is particularly important while using digitised 
materials that easily shadow the journalistic processes behind the texts by tak- 
ing them out of the context.? In my opinion, contextualising may also form 
the bridge that brings digital humanities and feminist history closer together, 
moving them from being strange bedfellows to being a functional pair. 


Conclusion 


In this self-reflexive chapter, I have discussed my own road to digital humani- 
ties, a journey which has actually only just begun. I believe that my reflections 
correlate with those of many of my fellow historians and other humanists who 
have started their scholarly work before the increasing digitisation of the soci- 
ety and are now trying to figure out how the digital methods can be used. What 
did I learn when conducting my small-scale case study? 

Based on my experience, I agree with Solberg, who has argued that digital 
environments create ‘new ways of interacting with’ the material.” I would like 
to add that, at least for scholars with limited digital skills, they offer an oppor- 
tunity to conduct the research in a more self-aware manner, when every step of 
the process needs more thought than a traditional research day working with 
paper archives, for instance. For a feminist scholar, digital humanities may also 
serve as a channel for emancipation if the scholar chooses to actively partici- 
pate in the process of analysis instead of relying on the results produced by IT 
support. However, to be able to do this, we need the support from our univer- 
sities to focus on this kind of a large-scale project that also requires time for 
learning new skills. 

My experience clearly demonstrates that conducting a basic digital analysis 
is possible even for a beginner if she receives sufficient support to carry it out. 
Additionally, the practice is the best way to increase one's understanding of dig- 
ital analysis. When the understanding increases, the research questions become 
more accurate at the same time. Within the limits of this small-scale project, 
the results were not mind-blowing, but they merely strengthened the results 
of other scholars focusing on the intertwined relation between national and 
transnational in the history of early feminism. However, the data also reveals 
new and previously unresearched questions, such as the use and development 
of vocabulary relating to women’s issues in Finland. Furthermore, results of big 
data analysis expose new ways of perceiving the material which may revolu- 
tionise gender history by revealing gender in places that previous research has 
been unable to grasp. By also challenging feminist scholars to take a step back 


Towards Digital Histories of Women's Suffrage Movements 161 


and examine the material from a distance, digital humanities has the potential 
to change our understanding of gendered patterns in the past. 

These questions require more sophisticated digital analysis than has been 
possible in this chapter, but it is an inspiring direction towards which I hope to 
be able to move, alongside other feminist historians in the future. One way 
to do this is to develop grassroots digitisation projects in which gender, race 
and other marginalising factors could be taken into account when selecting 
the objects of digitisation. These kinds of projects have the possibility of devel- 
oping the field by producing more localised and situated data collections that 
challenge the history we are writing and offer a broader participation in digital 
history work also for those with basic technological skills. 
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CHAPTER 10 


Of Great Men and Eurovision Songs 


Studying the Finnish Audio-Visual Heritage 
through NER-based Analysis on Metadata! 


Maiju Kannisto and Pekka Kauppinen 


Introduction 


Part of a nation’s cultural heritage is produced and preserved through audio- 
visual archives. Whatever has made it into the archival collection has passed 
severe processes of selection, which secure for certain cultural products a place 
in the cultural memory of a society. This process is called canonisation.? In 
this chapter, we explore how the canon is built up in the national audio-visual 
archive in the digital age. We study this by asking which people and what events 
and periods are ‘remembered’ not by the members of a nation, but by a collec- 
tive national memory resource, the online audio-visual archive of a national 
public service broadcaster. Like particular natural and historic sites consid- 
ered as ‘heritage, and thus referring to a cultural and historical resource for all 
generations, past radio and television is now protected through copyright and 
continued cultural recirculation and exploited as both private and public prop- 
erty? Our study focuses on the Finnish public service broadcasting company 
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Yle (former Yleisradio, founded in 1926) and its online archive. The dataset 
used in our research consists of Yle’s archival metadata, which we analyse as 
a historical source material using the method of Named Entity Recognition 
(NER) as it is implemented in the digital tool the Finnish rule-based named- 
entity recogniser (FiNER). 

Yle’s online archive, The Living Archive (Elävä arkisto), presents part of Fin- 
land’s audio-visual history. The archive is an illuminating case study as it is 
a large editorial historical audio-visual service with open access metadata, as 
well as a high profile among Yle’s media output. It was first launched in 2006 on 
the 80th birthday of the Yle Company and the Living Archive enabled Yle to 
celebrate its history, while at the same time providing the public with new ways 
of watching television programmes online. The core idea behind the Living 
Archive is that audio-visual clips are historical source materials—documents 
from and about the national past representing Finland’s national audio-visual 
heritage. In this way, the archive contributes the nation-building process by 
formulating collective national identity, as suggested by Benedict Anderson in 
his theory of nations as imagined communities.* As media researcher Derek 
Kompare suggests, the audio-visual heritage serves as a base of legitimacy for 
audio-visual media and memory; namely, as something worthy of attention, 
preservation and tribute. It can be used as a cultural touchstone, instantly sig- 
nifying particular times.’ 

The Living Archive continually publishes new material from Yle’s archives for 
viewing and listening via the online service. What is published is based on topi- 
cality, new copyright licences, Yles current programming strategies, audience 
wishes and new archival discoveries.* Archival material has accumulated from 
the first audio recordings at the beginning of the 20th century to the present: 
Yle selects and presents the audio-visual archive material for current users in 
constantly new ways by adding and framing the material in 'background arti- 
cles; written by journalists and archivists. Media history researcher Mari Pajala 
has analysed the ways in which the Living Archive attempts to make the mate- 
rial ‘alive’ and meaningful in the present through its journalistic front page, 
background articles and the possibility for interaction. In this way, by tying in 
moments of archive television with current events and television programmes, 
the archive continually connects the present with the past. 

The aim of this chapter is to produce new knowledge on the canon built in 
the Living Archive. However, at the same time, metadata as a source material 
and name recognition as a tool in a historical study must be analysed because 
the canon is connected with the limits and possibilities offered by the digi- 
tal material and tool. In this chapter, we first introduce the research material 
and our method, the metadata used and the NER-based analysis. This analysis, 
using the FiNER tool, enables the identification of particular historical indi- 
viduals, events and years from the metadata material. Finally, we discuss the 
limits and possibilities of our research process and results. 
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Finding Voices and Images of the Finnish Past: 
Metadata and FiNER Analysis 


So far, since 2006, tens of thousands of audio-visual clips have been published 
in the Living Archive, so a computer-assisted method is needed to make larger 
sense of such a vast source material. The historical researcher' role in delimit- 
ing and selecting relevant material for analysis is crucial to the success of the 
computer-assisted analysis results. In this chapter, the research data does not 
consist of the media clips of the media files stored in the Living Archive as 
such, but of the metadata of the media information of these media files. Yle 
made this metadata material available on 3 January 2018.$ Of this metadata, we 
selected the columns describing the title of the media, the promotion title and 
the description of the content (in Finnish). These columns describe the content 
of media clips, but not in general the author information. We did not want 
to include the latter in this study as we preferred to focus on the constructed 
national heritage and the individuals presented therein—not the authors of the 
various media documents. This could, however, be a relevant topic for future 
historical research. 

The use of metadata as data is problematic in some respects because the mate- 
rial is inconsistent and contains some double data and gaps, as well as, at times, 
false information. An essential part of the historical source criticism is thus 
to find out the classification and guidelines that have been used to create the 
metadata. In the Living Archive metadata, the media descriptions have been 
produced in three main ways. (1) Descriptions have been made by the authors 
of the original programmes in connection with their original production to 
serve as programme information for other media. (2) The reporter or adminis- 
trator of the Living Archive has written a short description of the clip’s content 
in its subject field. Since this description field has not been displayed to the end 
user, this field is not always completed, and thus about one-tenth of the fields 
are empty. (3) Due to technical failure in the migration of the archival clips in 
2011 (the transfer from one technical platform to another), the media informa- 
tion on the clips’ still images were incorrectly entered in the media clips’ subject 
field.’ Therefore, in place of the description of the audio-visual content, this field 
contains the content of the still image, together with the photographer's name. 
Only the content of some video clips published before 2011 may have been 
edited and supplemented after that year. In the analysis, we have been aware of 
the above-mentioned issues and taken them into account as much as possible 
(for example, we excluded the photographers from the analysis). 

NER is a task in Information Extraction consisting in identifying and clas- 
sifying some types of information elements, called Named Entities (NE). It is 
stated that NER analysis usually responds to the five typical questions in the 
journalism domain: what, where, when, who and why." In our analysis, we 
utilised FiNER," a rule-based NER tool loosely based on the Swedish NER 
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HFST-SweNER." At the time of our study, FINER had not yet been used exten- 
sively as a research tool and thererfore this chapter is also partly a methodologi- 
cal experimental study.'* FiNER was created for the FIN-CLARIN consortium 
at the University of Helsinki, which is the Finnish part of the European CLA- 
RIN (Common Language Resources and Technology Infrastructure) collabo- 
ration for developing research infrastructure for language-related resources in 
humanities and social sciences. 

FiNER utilises the Helsinki Finite-State Toolkit and its implementation of the 
pmatch (partial string matching) function," which allows the compilation and 
implementation of pattern-matching rules as computationally efficient finite- 
state transducers (FSTs). FiNERS pattern-matching rules employ a number 
of strategies in finding and disambiguation names, including hints in string 
structure (such as uppercase letters, affixation, etc.), collocations, runtime 
adaptation and gazetteers (lists of names). The FiNER tool accepts any plain- 
text input, but works best on running text that adheres to Standard Finnish 
spelling and typographic conventions. We used a set of UNIX text-processing 
utilities to extract relevant segments from the tabular metadata for tagging, as 
well as to filter out any superfluous data that might slow down or interfere with 
the NER process. 

In order to calculate total frequencies for names in the data, each matched 
word segment in the output first had to be lemmatised (that is, reverted to its 
uninflected form). Thus, we had FiNER output lemma forms and morphologi- 
cal analyses for each word and created a Python script that extracted matched 
sequences and used morphological information to lemmatise each of them. 
Once all names had been printed out in their lemmatised forms, their frequen- 
cies in the data could be calculated with relative ease. 

As digital history researchers Kimmo Elo and Olli Kleemola have pointed 
out, it is essential for the researcher to understand how the computer-assisted 
analysis produces the results.!* Thus, we looked at the frequency lists of differ- 
ent categories of FiNER analysis from the perspective of the possibilities and 
limits set by the technical characteristics of the tool, as well as possible errors. 
After having removed the names that occurred as a result of technical error, we 
put together TOP10 and TOP20 lists from the names that have received most 
mentions in different categories. Finally, in order to understand the results, it 
was important to check the background articles in the Living Archive and to 
contextualise them tentatively with Finnish cultural and political perspectives. 


Great Men, Journalists and Musicians as Remembered Persons 


Peoples names are generally better recognised than other name entity types.'* 
FiNER recognised nearly 12,000 people in our research material; in the analysis, 
we focused on those who had received most mentions. When analysing the 
people, we found three main groups dominating the historical personage of 
the Living Archive: great men, journalists and musicians (see Table 10.1). Among 
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Table 10.1: Top 20 ofthe most frequently mentioned people in the metadata of 
the Living Archive listed by FiNER. 


Number of | Top 20 people Their professions 
mentions 
353 Urho Kekkonen (includes Urho | President of Finland 
Kekkonen and ‘Kekkoner) 
223 Arto Nyberg journalist 
125 Kirka musician 
111 Marjukka Havumäki journalist 
105 Vesa-Matti Loiri musician, actor 
90 Mauno Koivisto President of Finland 
89 Mikko Alatalo musician, journalist, politician 
89 Matti 
86 Bettina Sägbom journalist 
78 Matti Rönkä journalist 
78 Danny musician 
76 Päivi Istala journalist 
74 Göran Palm journalist 
73 Mannerheim President of Finland 
71 Paula Koivuniemi musician 
71 Jaakko Selin journalist 
69 Lasse Märtenson composer, musician 
69 Antti 
68 Susanna Ström-Wilkinson journalist 
64 Juha Laaksonen journalist 


Note: Photographers are excluded. Source: Author. 


the Top 20 of the most represented people, President Urho Kekkonen is the 
most freguently mentioned person. Kekkonen (1900-1986) was the longest- 
serving President of Finland (1956-1982) and a politician who achieved an 
almost unchallengeable political position during his long presidency. President 
Kekkonens prevalent role in the archival clips is also emphasised by the fact 
that his trusted photographer Kalle Kultala received the second most mentions 
in the material, although this is due to an incorrect overwrite in the media 
information. Kekkonen is the protagonist in many articles, which include a 
number of video and audio clips from election campaigns, presidential visits 
and public speeches, as well as clips of him carrying out his hobbies. During his 
25 years of presidency, Kekkonen gave 25 presidential New Year’s speeches and 
hosted 22 Independence Day receptions, both of which are essentially national 
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audio-visual broadcasts, repeated annually and among the most watched pro- 
grammes on Finnish television. 

President Kekkonen was connected to the Finnish Broadcasting Company 
Yleisradio in many ways. First, he had close personal relations with people 
working there. Second, his political career was concurrent with the develop- 
ment of broadcasting in the 1930s and 1940s. Finally, he often appealed to 
the people in his radio speeches during the growth in radio licences. It can 
be argued that his long reign was comparable to the monopoly of Yleisradio." 
Media researchers Lotta Lounasmeri and Johanna Sumiala have argued that 
President Kekkonen used the new mass media skillfully. He managed to com- 
bine different roles by acting as a sovereign political leader in the mass media 
while, on the other hand, at other less formal occasions, performing as a man 
of the people by skiing, fishing and meeting people in the countryside. Iconic 
images of the photogenic president were spread across the mass media, making 
these audio-visual presidential representations part of the nation’s public mem- 
ory. In this way, the Living Archive played an important but previously mainly 
neglected role in preserving and presenting a political part of the national 
audio-visual memory. Furthermore, President Kekkonen not only appears in 
the archival clips, but also as a character in documentaries, drama series and 
sketch shows, and as a reference point to which other politicians were reacting. 
This is not just a past but also a very active part of the national memory, as the 
cult of President Kekkonen still strongly survives as a nostalgia and longing for 
strong leadership.” 

Among the Top 20 of the most mentioned people, there was also another 
President of Finland, Carl Gustaf Emil Mannerheim (1867-1951), who could 
be characterised as a great man in national history. Mannerheim, Marshal of 
Finland, was frequently mentioned, even though he appears in few contem- 
porary audio-visual sources. However, he is retrospectively presented in the 
archive through later historical photographs, film clips, and radio speeches and 
documents. Furthermore, Mannerheim’s character became a part of Finnish 
popular culture in the form of a controversial doll-animated short film in 2008 
and through other films. According to historian Tuomas Tepora, the contra- 
dicting views have been an essential part of Mannerheim’s mythical role in 
Finnish cultural memory.” His legacy also appears in the material when talking 
of The Mannerheim League for Child Welfare and the Knights of the Manner- 
heim Cross, awarded to Finnish soldiers. 

Along with the great men, men in general dominate the media clips: 72 of 
the first 100 people are men. Of the 10 women mentioned, the top eight are 
Yle journalists; only singer Paula Koivuniemi and President Tarja Halonen 
are also cited. As is well known, in the Western historiographic tradition, 
women have rarely been treated as active actors until recent years and therefore 
they have not been given much space in national historiography. Although the 
tradition has been challenged for almost 50 years now, women are still under- 
represented in most canonised historical narratives.” This is also the case in the 
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Living Archive, in spite of eguality work being in operation for a long time now. 
This is also related to the history of the feminisation of Finnish professions. 
President Tarja Halonen was the first female Finnish President (2000-2012). 
This can also be observed among the journalism profession, another dominant 
group in the archive. At Yleisradio in 1972, only 35% of the journalists were 
women,” and the number of female journalists only increased significantly 
from the 1970s onwards. 

In general, the journalists of Yle to a great extent appear in the material. The 
Top 20 listing includes eight Yle journalists (see Table 10.1). In the metadata, 
journalists are mentioned in many respects. News anchors are mentioned in 
connection with the archival news clips, while journalists working on the Living 
Archive are mentioned in connection with the background articles. Journal- 
ist Arto Nyberg is the second most freguently cited person in the archive. 
He has since 2004 hosted a popular talk show named after him, where many 
Finnish public figures and celebrities have visited and many clips therefrom 
are published in the Living Archive. In this audio-visual history, journalists are 
the key figures and agents behind the production of the media clips and thus 
appear freguently in the material more in this capacity as facilitators and pro- 
ducers, rather than as objects of newsworthy events. This important factor of 
the media production system needs to be taken into account. 

In addition to journalists and presidents, many musicians are also mentioned 
in the audio-visual archive: six are included in the Top 20 listing. What unites 
these musicians is great popularity, which often comprises long careers stretch- 
ing to several decades of hit songs. When compared to the most freguently 
mentioned bands listed by FiNER, the musicians are more focused on older 
pop singers and evergreen hits, while the bands category is more versatile and 
focuses on newer bands. The most freguently mentioned artists are those who 
have had visibility in Yles music programmes, such as Eurovision Song Con- 
tests and festival recordings. For example, Mikko Alatalo’s prominence in the 
archive emanates from his versatile TV work in Yles music programmes in 
the 1970s and 1980s. 


Eurovision Song Contests, Sports and Wars Unite the Nation 


Music also plays an important role in FiNER’s list of the most frequent events. 
In part, this is due to Yles major role in recording music festivals and pre- 
serving them for the audio-visual heritage in the Living Archive. However, the 
largest national audio-visual event is not a national but a European event in 
the form of the Eurovision Song Contest. As an annual and long-lasting tel- 
evision event, the song contest has been an important event of popular cul- 
ture. Historian Mari Pajala has argued that the Eurovision Song Contests have 
become a prominent part of Finnish national memory and history. The regular 
annual contests have particularly participated in Finnish nation-making by 
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creating media discussions surrounding the meaning of nationality in the col- 
lective cultural memory. The Living Archive reviews both the successes and 
the failures of Eurovision. Finland has historically been very unsuccessful in 
the contests, but this finally changed in 2006 with the hard rock band Lordi's 
victory. The long tradition of negative experiences and the decades of disap- 
pointment were then forgotten as public festivals reupted in market squares all 
over Finland.” In addition to the international contest, the national Eurovision 
qualifiers are presented extensively in the archive, which gives many artists a 
place in the national audio-visual history. 

Another popular category of events is major sports competitions, such as 
the European and World Championships in ice hockey and track and field. The 
broadcasting of these major sport events has been strategically significant to 
the Finnish Broadcasting Company since they attract large audiences and 
are therefore legitimising the company’s role as a public service broadcaster. 

The third category relates to war. Wars and political conflicts identified by 
FiNER are listed as events, together with music, sport, art and entertainment, 
which can, however, be analytically difficult for comparisons. Among the wars, 
the Finnish Winter and Continuation Wars are particularly well represented 
in the media material. These wars are significant episodes in Finnish history 
and form part of the national mythology formulated through national public- 
ity, as well as, for example, through their appearance in schoolbooks.? There- 
fore, these two wars also form an important part of Finnish cultural memory.” 
The Living Archive preserves the remembrance of wars in connection with 
different anniversaries. 

According to Pajala, many of the traditional moments of the television year 
are explicitly related to nationality: Eurovision Song Contests and sport events 
unite the national public in excitement at the presence of their own representa- 
tives, while the states sovereignty is celebrated at the President's Independence 
Day Reception, and in the national war films.” These national audio-visual 
events are accumulated and added to the Living Archive annually. 


Nationally Significant Periods as Recognised by Yle 


Different periods from the 1960s to the 2010s are remembered relatively evenly 
in the Living Archive, even though most of the programmes have been pre- 
served only in the last two decades (see Figure 10.1). However, it is impor- 
tant to note that in FiNERS analysis, the recognition of the decade numbers 
is a challenge: FiNER may not recognise these without using clues from the 
text contexts, and may instead interpret them as ordinary words rather than as 
markers for historical periods. This produces gaps in the time series. There are 
three peak years in the FiNER frequency list: 1976, 1995 and 2008. When trac- 
ing these years from the media material, we noticed that during these periods 
there were an extraordinary number of significant political and cultural events. 
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Figure 10.1: The occurrence frequencies of the years 1917-2017 in the time 
series identified by FiNER from Yle’s metadata material. Note: The peak years 
1976, 1995 and 2008 have been circled. Since the summer of 2015, all the pro- 
grammes of the Living Archive have been published via the Areena platform, 
which causes a harsh decrease in the metadata material. Source: Author. 


We suggest that these events have become key experiences determining the 
past and are thus significant parts of the national memory.” These historical 
events are presented in the background articles and collectively remembered 
in the Living Archive. 

Among the 1976 media clips, the long-distance runner Lasse Virén's wins in 
the 5,000- and 10,000-metre runs at the Montreal Olympics and the drama series 
Myrskyluodon Maija (Maija of Myrskyluoto), shown on Yles main channel, are 
both highlighted as key collective experiences. The latter television series was 
set in the Finnish archipelago during the 19th century and pictured the life of 
the protagonist Maija and her family in six episodes. Lasse Martenson, who was 
among the Top 10 in FiNERS list of the most represented people, composed 
the music for the series. The series has been characterised as unforgettable by 
many, in large part due to the music. In actual fact, the piano transcription of 
the theme music is the best-selling Finnish music publication of all time.? The 
Living Archive contains several background articles about both the music and 
new versions of the songs, such as the highlights of 15 versions of the theme 
music performed by various artists.? As evidenced by its many occurrences 
in our source material, the melancholic composition apparently succeeded in 
touching the collective Finnish psyche and therefore in becoming a prominent 
part of the national audio-visual heritage. 
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In 1995, two main events dominate the media material. The first was that Fin- 
land joined the European Union. This was the topic of a large number of media 
clips, such as news and current affairs programmes, as well as sketches, and 
there are many background articles belonging to the EU thematic. The second 
event was winning the Ice Hockey World Championship, which was and has 
continued to be a key national experience. Ice hockey is Finland's largest sport 
in terms of media visibility and every spring the Ice Hockey World Champion- 
ships attract a large national audience. The televised winning final in 1995 was 
watched by 4696 of all Finns. Cultural historian Hannu Salmi has connected 
the boom of ice hockey to changes within the Finnish media landscape. In the 
1980s, when the national team began to enjoy success, ice hockey became more 
visibile and television rights became the subject of a struggle between pub- 
lic service and commercial media companies.* The victory in 1995 was the 
first World Championship win for any Finnish sports team and the nation was 
united in celebrating the victory together. In public, the victory was made into 
a question of national self-esteem.” Therefore, remembering the great victory 
also has an important role in the Living Archive. 

During the third peak year of 2008, the key experiences are connected to the 
political elections, which received significant media attention. In Finland, the 
Finns Party (perussuomalaiset, formerly the True Finns Party) was the biggest 
winner in the municipal election and brought forward a strong criticism of 
the existing immigration policy in public discussion. Another notable politi- 
cal event that contributed to the 2008 peak was an event occurring outside 
Finland, when, in the United States, Barack Obama was elected President. 
However, most of the accumulated media clips during the year concerned the 
Kauhajoki school shooting, when a student shot and killed 11 people, including 
himself. After an earlier Finnish school shooting in Jokela the previous year, 
it was not possible to treat the Kauhajoki shooting as an individual incident, 
and journalists thus tried to find political and social explanations for the trag- 
edy (for example, related to gun legislation).? The Living Archive documented 
this debate. 

In addition to these events, many other 2008 media clips focused on the 30th 
anniversary of the Finnish rock festival Provinssirock. Even if many key experi- 
ences are connected with Finnish or world history, the audio-visual nature of 
the archive places a particular emphasis on audio-visual events. These events 
emphasise the particular role of Yle—for example, regarding television series, 
the Eurovision Song Contest, sports and footage of music festivals. Therefore, the 
national audio-visual heritage is Yles particular heritage and its particular 
contribution to Finnish history; Yle owns a large proportion of the Finnish 
audio-visual heritage. Commercial media companies, like MTV founded in 
1957, have no similar archives and there are no programmes left from the 
early decades of the independent Finland. During the early years of television, 
programmes reached large audiences, allowing programmes to create iconic 
images and become part of the national public memory.** However, in the 
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Living Archive, commercial television and radio are left aside, rendering it an 
important resource for the public service company Yle in the struggle among 
the various media to legitimise its cultural position. 


Finally: The Limits and Possibilities of Interpretation 


Digital archives are organic entities that grow and change their shape as new 
materials are added. The strategy of Yles Living Archive is to grow constantly, 
which produces a cyclical nature for the production of this national audio- 
visual heritage. Annually recurring events related to the audio-visual culture 
have generated the most archival material over the years. In this way, Yle has 
created a national audio-visual annua] calendar to commemorate significant 
moments, like Independence Day. Another strategy of the Living Archive, 
tackling contemporary topical issues, in turn, binds the past to this present 
moment. The archive actively follows current topics and events. This strategy 
equals Derek Kompare’s notion that the legitimation of television, as in the case 
of archive, was based not on the existing canon, as it had been with film, but 
on the idea of heritage. While the canon tends to be separated from everyday 
life and located in a refined, timeless sanctuary, a heritage is part of the lived, 
historical experience of a culture.” Both of Yle's strategies have succeeded in 
producing the most popular content in the archive, such as responses to topical 
issues and death notifications for public figures. 

It is important to note that the choices of which media clips are published in 
the Living Archive and their output are political choices that remain invisible to 
archive users. As Pajala has pointed out, the archive has limited possibilities 
to publish anything not shown by public service television. As a result, for- 
mer legislation and cultural norms continue to restrict today's debate on the 
subject of audio-visual archives.” In addition, the copyright agreements sig- 
nificantly shape the publication choices, as not all of Yle's archive material can 
be published. During the first years, the only materials published were Yles 
own journalistic programmes and films redeemed from old film companies. 
Another restriction is that all released music has to be reported to copyright 
organisations, but since 2015, a separate agreement has made it possible to pub- 
lish many old music programmes, such as the Finnish gualification competi- 
tions for the Eurovision Song Contest and the music programme Hittimittari 
from the 1980s. A drama agreement in 2016 has also enabled Yle to release 
drama series and films, with long publishing rights. However, this is not only 
about a guestion of choice, since not all of the old material is actually available. 
Only since 1984 has all self-produced TV material been archived. In addition, 
the old programmes may have very poor metadata, making them more difficult 
to locate.“ 

In addition to the choice of data, we also have to take into account FiNER'S 
specific internal limitations, which are similar to those of other rule-based 
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NER systems: any shortcomings in rule formulation or gazetteers result in false 
positives and misclassification of matched names, which if no rule applies, 
can make a name go altogether unnoticed. Single-word names of organisa- 
tions and events are particularly difficult to identify without context clues or 
structural hints. This is particularly true for short texts such as the metadata 
entries, where the reader is often assumed to be familiar with names and their 
referents. In the case of rule-based NER systems, any misses and tagging errors 
are human in nature as they arguably reflect the system developer's ability or 
inability to formulate exhaustive rules, as well as their oversights when building 
the gazetteers. 

Metadata as a research material also raises wider ontological questions of 
historical knowledge: What can we really know about the audio-visual cul- 
tural heritage on the grounds of metadata and FiNER analysis? Digital history 
researcher Michael J. Kramer discusses the relationship between historians and 
digital archives and describes history as a meta-metadata.*! There are so many 
layers of interpretation: metadata is not merely descriptive, but is also already 
an interpretation of an archivist. Another issue stems from the fact that the 
results of the FiNER analysis are the lists of frequencies of names. These fre- 
quencies are not directly related to media material: FiNER as a tool has its own 
limitations and rules on how it interprets and categorises the metadata that 
has been pre-processed by the researcher, something which also brings its own 
specific limitations. Finally, the historical researcher interprets the results of 
FiNER. We suggest that it is actually already a meta-meta-metadata of the tar- 
get of study. In order to understand all these layers of interpretation, we need 
collaboration between archivists and historians to make visible the guidelines 
and ways of writing metadata. The most fruitful approach would be for them to 
cooperate in negotiating how metadata best serve both parties. 

A historian also needs to be aware of the functional logic of the digital tool 
for being able to recognise the bumps on the road of interpretation." Our NER- 
based analysis revealed an interesting emphasis on people, events and years 
in the audio-visual heritage constructed by the Living Archive. However, the 
FiNER analysis did not shed light on why and how these certain topics are rep- 
resented in the archive. Instead, what the analysis did show is why audio-visual 
heritage is constructed in a certain way. 

Metadata, like Yles metadata, is often messy and requires a significant 
amount of selection and pre-processing. In addition, the metadata material, as 
well as the digital tool, has limitations, as described above.? The large volume 
of data can compensate for some of the technical limitations. However, it is 
important to acknowledge that, in the humanities, the shift from small smart 
data to big data is not just technological; in fact, it seems to be even more of a 
methodological shift. Methodologically, it means the shift from close reading 
to distant reading. In this paradigm, instead of reading a few selected texts, we 
can analyse an entire collection of relevant textual data. However, the cultural 
contextualising and close reading of the themes pointed out by the results of 
NER-based analysis still play an important role in the analytical process. Only 
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after the cultural contextualising and close reading do the word lists come alive, 
so they can animate the role of the audio-visual heritage in the construction of 
the Finnish imagined national community broadcast on radio and television. 
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Notes 


' An earlier, Finnish language version of this research entitled Kekkonen, 
Euroviisut ja Helsinki —kansallinen audiovisuaalinen perintö NER-analyysin 
tunnistamana was published in the journal Ennen ja nyt (2/2019). This 
edited version is published with the permission of the journal Ennen ja nyt. 

* Assmann 2008: 100. 

? Kompare 2005: 103. 

^ Anderson 2007/1983. 

* Kompare 2005: 102-103. 

* Yli-Ojanperä 2018. 

7 Pajala 2010: 134, 142. 

* Material is in the form of a CSV file, which is 18 megabytes in size and 
contains 34,816 entities. 

? Elina Yli-Ojanperás interview 23 May 2018. 

10 Marrero et al. 2013: 484. 

" [n this chapter, we use a work version between two publications, which, 
however, is very similar to the version that has been published at the end of 
the year 2018. See http://urn.fi/urn:nbn:fi:lb-2018091301. 

? Kokkinakis et al. 2014. 

13 However, see Kettunen et al. 2017. 

1 Karttunen 2011. 

15 Elo & Kleemola 2016: 154. 

16 van Hooland et al. 2015: 13. 

17 Puro 2016: 23-24. 

35 Lounasmeri & Sumiala 2016: 3-4. 

9 On the cult of President Kekkonen and his popular cultural image, see 
Kallioniemi, Karki & Mahka 2016. 

? Tepora 2015b: 8-11. 

^ Halldórsdóttir, Kinnunen & Leskelä-Kärki 2016: 19. 

? Kurvinen 2013: 68. 

? Pajala 2006: 351-368. 
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% Torsti 2012: 135-155; Ahonen 2017: 10-11, 92-95. 

23 Tepora 2015a. 

?* Pajala 2006: 24; Pajala 2012. 

7 Cf. Sumiala-Seppänen 2007: 283; Torsti 2012: 63-73. On collective 
memory, see Benton 2010: 1, 5. 

% Elina Yli-Ojanperä's interview 20 June 2019. 

? Helsingius 2010. 

?' Yle Elävä arkisto (Living Archive) 2015. 

31 Salmi 2015: 13, 32-46. 

? Kannisto 2015: 95-96. 

33 Raittila et al. 2009: 112. 

= Pajala 2010: 135. 

35 Jarlbrink & Snickars 2017: 1240. 

% Kompare 2005: 105. 

9 Yli-Ojanperä 2018. 

3 Cf. Jarlbrink & Snickars 2017: 1229. 

32 Pajala 2010: 140. 

^ Yli-Ojanperä 2018. 

^! Kramer 2014. 

? See also Elo & Kleemola 2016: 154. 

5 See also Kannisto 2016. 

^ Schóch 2013. 
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CHAPTER II 


Tracing the Emergence of Nordic 
Allemansrátten through Digitised 
Parliamentary Sources 


Matti La Mela 


Introduction 


Allemansratten, a right of public access to nature, is an integral part of the iden- 
tity and lifestyle of the people in the Nordic countries. This principle, which 
is commonly seen as an age-old tradition, allows everyone to access and use 
resources in the wild even without the landowner’s consent. Despite its major 
role in contemporary Nordic societies, the roots and the development of this 
institution are little researched and not well known. This chapter contributes to 
the historical revision of allemansrátten by studying public uses of the concept 
in Finland in the 20th century. Such a broad study is possible using the recently 
digitised documents of the Finnish Parliament, which offer a unique view on 
how central societal concepts have been defined and used in public discussion. 
The chapter asks how and when allemansrätten actually emerged as a term, and 
to which discursive environments the concept was tied in the public debates of 
the 20th century. 
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The aim of the chapter is to challenge the common view of allemansrät- 
ten as an age-old and stable tradition, and to demonstrate how the concept is 
historically constructed and has been flexibly used as part of different political 
discourses. In particular, the chapter focuses on the principle of universality 
inherent to the modern allemansrátten. Even though it is acknowledged today 
that allemansrätten took its form only with the modern processes of urbanisa- 
tion, growth of free time, and development of new ways of recreation in nature 
after the 1930s, many authors build continuity to pre-modern Nordic legal cul- 
ture or access practices.! On the other hand, some critics have proposed that 
allemansrätten was actually an ideological move to socialise private land to the 
use of everybody? Most recently, however, a mid-way has been sought, where 
the past and modern cultures of access to nature are discussed separately, to 
emphasise differences in the social contexts and to demonstrate the parliamen- 
tary political support given to allemansrätten in the 20th century? 

The chapter uses methods of text mining, and uses the Finnish parliamen- 
tary documents as its source. This digitised data has been created only very 
recently, and is discussed therefore rather thoroughly in the second section 
of the chapter. It is notable, however, that other digitised sources, for instance 
newspapers, have already been used for studying parliamentary debates.* The 
digitised newspapers are currently available only until 1929. The new digit- 
ised parliamentary documents, therefore, are not only an important dataset for 
studying policy and law-making, but also offer an important perspective over 
the broader public debate after 1929, which no other complete digital collection 
in Finland currently represents. The digitised parliamentary sources of other 
countries have been used in historical and social scientific research. These 
include straightforward debate analysis through keyword searches, but also the 
use of more complicated methods such as sentiment analysis. 

This chapter studies the history ofthe uses of allemansrätten in the parliamen- 
tary data in two steps. First, in the section entitled ‘Allemansratten Emerges, 
below, keyword searches and frequency analysis is used for discovering the 
general trajectory of the term in the complete dataset from 1907 until 2000. 
The parliamentary debates are understood as reflecting common language use 
of the time, and thus reflecting topics which were central for the contempo- 
rary public discussion. Second, in the section entitled 'Allemansrátten since the 
1970s, below, the aim is to study differences in the discursive environments 
where allemansrátten was used. Key parliamentary debates are identified, and 
the parliamentary data is text-mined into two detailed debate corpora from the 
1970s and 1990s. These two corpora are analysed with collocation analysis and 
topic modelling, and the results are contrasted with each other. The hypoth- 
esis of the chapter is that the term appears only after the Second World War, 
which has been preliminarily confirmed from sporadic sources. However, 
in which discussions did this take place, and how was the modern and com- 
monly acknowledged concept appropriated in the public discussion? 
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Digitised Parliamentary Talk of the 20th Century: 
Quantity and Quality 


Parliamentary documents are a classic source in history. They are used to study 
the past national legislative work, but also offer a broader view on the political 
culture and political language, as well as major societal issues of the time.’ Inter- 
nationally, parliamentary sources have been digitised only in the past 10 years 
and are now available for research use.* In the Finnish case, the parliamentary 
documents of the latest decades have been available digitally for several years; 
however, the digitisation of the documents of the unicameral parliament from 
1907 until 2000 were finished only recently: the online digital collection was 
inaugurated in September 2018.’ 

The digital collection includes the documents of the Finnish Eduskunta, uni- 
cameral Finnish Parliament, which convened for the first time in 1907, when 
the country still formed an autonomous Grand Duchy as part of the Russian 
Empire. The parliamentary reform leading to Eduskunta was radical at its time, 
as it expanded the suffrage to the whole of the male and female population 
and, first in the world, enabled women to stand for election. From a more 
concrete point of view, however, there was important institutional continuity, 
as the national and local representative rights had already been exercised since 
the 1860s. The national Diet or Assembly of Estates, which gathered the rep- 
resentatives of the four estates for legislative work, had convened in 1809 and 
after that regularly since 1863. As Pekonen has shown, foreign parliamentary 
practices were carefully studied in the 19th century and the basis for the parlia- 
mentary procedure was established, for example, regarding minute keeping." 

The digitised collection comprises the printed volumes that have been com- 
piled during the parliamentary season (ranging usually from February to Janu- 
ary).'' The parliamentary sources contain both static documents, which were 
the basis of the parliamentary work or produced in the legislative process, and 
dynamic minute keeping, which recorded the speeches and the procedure of 
the sessions. It is important to note that even though the minutes were recorded 
in detail and directly, they have gone through minor editing in the transcrip- 
tion process at the Records Office (for example, regarding the use of dialects).'? 

The parliamentary documents follow until 1975 the publication practices 
established already in the 19th century, according to which the documents 
were grouped together per legislative case (minutes proceed chronologically). 
The materials of the season were published both in Finnish and in Swedish, 
of which the Finnish collection forms the complete collection, and Swedish 
texts include translations of the main documents and a summary of the min- 
utes. The annual publications consist of two to four volumes of the Minutes of 
the parliamentary sessions (Pöytäkirjat), one volume of the Swedish summary 
(Protokoll i sammandrag), three to five volumes of the Documents in each lan- 
guage (Asiakirjat, Handlingar) and the Annexes (Liitteet, only in Finnish).? For 
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the season 1975/II (which started in September), the Documents and Annexes 
began to be regrouped in Documents series (A-F in Finnish, A-D in Swedish) 
according to their type. Besides this, a separate Index (Hakemisto, Register) was 
published, which had been included prior to 1975/II in the last volume of the 
Minutes." The index of the Minutes has been extracted during the digitisation 
into a separate file for the seasons 1948 to 1975/I. 

The digitised material has been published online by the Finnish Parliament.’ 
In the digitisation process, the separate volumes have been scanned, optical- 
character recognised (OCR) and stored as PDF/A files. The online interface 
allows keyword searching of the text content of the pdf files, which, however, 
has considerable challenges due to errors and typographical features included 
in the OCR output.'^ The separate pdf files (the different printed volumes) can 
be downloaded and used separately. This complete dataset consists of 92.4 GB 
of optically recognised pdfs, of which 61.6 GB are in Finnish and 30.8 GB are 
in Swedish. The pdf files have been named according to their publication types 
(described above) and publication year, but the dataset does not include any 
text or metadata files. 

For this chapter, the text content of the pdfs was extracted with the pdftotxt 
tool included in the open source Xpdf." The pdf format does not offer a good 
structure for the text, and the text output was cleaned for the analysis con- 
ducted here. In ‘Allemansratten Emerges’ below, the raw documents could be 
used while they suited keyword searches and close reading of the search results. 
For the topic modelling carried out in Allemansrätten since the 1970s, below, 
two debate corpora were refined manually: only selected law cases were picked, 
and from these, only speeches by representatives were extracted and corrected 
into simple text." 

As the digitisation of the parliamentary documents has been carried out with 
printed material, the OCR quality of the material is generally very good. How- 
ever, it varies a lot and needs cleaning to be used for detailed textual analysis. 
As there is currently no previous research or evaluation of the quality of the 
material, I conducted a very rough analysis of the word recognition rates con- 
cerning the complete material per decade. I used the LAS-tool,? which has a 
functionality for word recognition rate. The recognition was conducted per 
file, and an average was calculated for document type per year. 

As shown in Figure 11.1, the recognition rates are between 6096 and around 
9596, mainly being over 8096. In comparison to the digitised Finnish newspa- 
pers, the quality is very good, as a word accuracy of around 7096 to 7596 has 
been reported for the historical newspapers.? It is notable, however, that the 
quality varies to some degree in the parliamentary material, and is lowest for 
the documents published in the 1920s and 1930s. Furthermore, the LAS-tool 
uses only one language when detecting recognition rate in the complete file. 
The Minutes in particular have a lower detection rate for the earlier years, as 
Swedish was used more commonly by the MPs. Thus, the real recognition rate 
is slightly higher than shown in the graph, but the general trend surrounding 
the quality of the digitisation is clearly visible. 
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Figure 11.1: Tokens recognised on average per year in the Finnish-language, 
digitised parliamentary documents. Source: Author. 


Allemansrätten Emerges: From Every Man to Fishing and to 
Modern Access Rights 


In this section, I will explore the emergence of allemansrätten as a commonly 
known and shared concept in the Finnish public discussion of the 20th century. 
This history is not known, and offers an important perspective on the tem- 
poral tension inherent in allemansrätten. On the one hand, allemansrätten is 
commonly narrated as an age-old tradition or as deriving from the specific 
Nordic culture. For instance, in autumn 2018, the Finnish outdoor association 
Suomen Latu launched a public campaign broadly visible in the national media, 
which aims to add allemansrätten to UNESCOS list of intangible cultural her- 
itage.? On the other hand, the scholarship has acknowledged, mainly in the 
Swedish case, that allemansrätten saw the light of day only after the 1930s, with 
the development of urbanisation and modern mass outdoor recreation. As 
Kardell writes, since its first appearances in the 1930s, allemansrätten grew to 
become part of the Swedish nation’s soul (folksjülen).?* 

How is the case with Finland, then? On a general level, it is known that also 
in Finland, the term allemansrátten, or jokamiehenoikeus in Finnish, becomes 
commonly used only after the Second World War. The parliamentary data 
allows us to study the trajectory of the term during the whole century. It also 
fills an important gap, as the digitised national newspapers are available only 
until 1929. I will first trace the appearance of the term in the parliamentary 
data, and then focus on particular instances by studying, on a sentence level, 
in which ways allemansrätten could be used in public talk. In this mapping, 
I will use the complete dataset, and in the following section of the chapter, I 
will focus on more limited corpora to study the discursive environments in 
more detail. 
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Figure 11.2: Frequencies of ‘everyman” ('jokamie*, ‘joka mieh? in Finnish, 
and 'allemans* in Swedish) per million words in the parliamentary docu- 
ments per decade. Source: Author. 


I used multiple keywords in the search in order to shed light on the differ- 
ences between the uses in the two national languages, Finnish and Swedish, and 
to capture a broader variety of terms used.?* As we can see from Figure 11.2, 
there are differences with regard to how much the terms were used. Before the 
1940s, the Swedish term allemansrätten is not used at all, and the Finnish ‘joka 
miehen' (every man's) appears to some extent. In the 1940s, there is a curious 
peak in the use of the term in Finnish. Finally, only in the 1970s, the use of 
the written forms jokamiehen (everyman's) and ‘allemans? really boom and 
become common. The other word forms are used less frequently and disappear. 

When we look at the actual uses of the terms in the parliamentary data, we 
can shape three periods. In the pre-Second World War era, the term ‘everyman’ 
referred to a common, ordinary person. During the Second World War, 'every- 
man' was used in the context of nature and outdoor activities with the intro- 
duction of the wartime 'everymans right of fish: Finally, only gradually since 
the 1960s, ‘everyman became employed in its main contemporary meaning as 
describing allemansrátten. 

The Finnish ‘joka mies, everyman or every man, is obviously a term that has 
already appeared in the language for a longer amount of time. In this meaning, 
it refers to an ordinary person or generally everybody. It is important to note 
that this Finnish term translates into the Swedish “var man’ (everyman), and 
does not bear similarity to the Swedish-language word root alleman pointing at 
allmän (common, general) and allmänningar (the commons). In general, then, 
the origins of the Finnish vocabulary are less bound to land ownership and 
social conflict." Moreover, in the early century, the term was not yet used in the 
context of outdoor activities or access to nature. We find manuals, guidebooks 
and even magazines for the ‘everyman (for example, the 'Everymans—and 
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every woman's—weekly' first published in 1907, later called the “Everyman's 
weekly"). In the parliamentary data of the early century, this is how ‘joka mies’ 
(every man) was used. In the Minutes of 1935, national defence was supported 
by stating that the ‘defence question is a question for every man.” 

Interestingly, the first major uses of ‘everyman as explicitly concerning pub- 
lic access to nature took place in the early 1940s in relation to fishing rights. 
During the wartimes, temporary fishing rights were enacted in Finland in 1941 
to alleviate the shortage of food. At first, the rights concerned fishing in one's 
local waters, but these were extended in 1943 to allow for some years fishing for 
everyone in all parts of the country. The right concerned non-commercial fish- 
ing by the household, but allowed special permissions for professional fishers, 
who were immigrants away from their home lakes. The right became labelled 
as ‘everyman’s right to fish or ‘everyman’s fishing’ (jokamiehen kalastusoikeus, 
jokamieskalastus or, in Swedish, var mans fiskerätt). In Parliament, there was 
also criticism raised against this right; however, its temporary and exceptional 
nature was acknowledged. In 1946, in the aftermath of the war, it was noted 
how among the ‘everyman fishers’ (jokamieskalastaja) there were also many 
immigrants and locals who were in dire need of fish. It was noted by the social 
democrat MP Tuomas Bryggari how this right should not only be made an 
exception, but ‘a general law, so that every citizen would have the right to fish.” 

The discussion about ‘everyman’s fishing’ continued in the 1950s, but it was 
only in the 1960s that the ‘everyman’ was used to refer to the modern alle- 
mansrätten. However, we still find discussion about fishing and ‘everyman as 
everybody or a common person as in ‘everyman’s sports. In the parliamentary 
data, the first uses of allemansrätten is in Finnish in 1964. In his question to the 
government, the left-wing MP Kalevi Kilpi (and others) asked about how 
the future outdoor legislation would react to the question of no trespassing 
signs. Kilpi added how according to ‘custom there existed a so-called every- 
mans right to roam on another's land without the permission of the owner. 
This is the period when the term allemansrátten became very common in 
Sweden. The word appeared as part of land planning and urban nature use 
in the late 1930s, and in the early 1950s, the public in Sweden could read in the 
newspapers how allemansrätten did not really appear in Swedish law, but gave 
everyone the right to move freely in the woods since the early times. In Finland, 
sporadic appearances are found in these years, but the country seemed to fol- 
low its western neighbour only in the following decade. 

In the 1970s, then, the term allemansrätten became very widely used in the 
parliamentary data. This peak is explained also by the particular moment, while 
the new outdoor legislation was discussed in the parliament in the early 1970s. 
However, this expansion in the use of allemansrätten is also due to the term 
becoming common in the Finnish language, perhaps even a rhetorical motif 
describing public access rights in general. The peak in the uses of allemansrät- 
ten does not decrease, but stays at the same level and even increases in the fol- 
lowing decades. Moreover, the term allemansrätten becomes mainly associated 
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with the area of public access rights to nature. Already at the time, the term 
was commonly used in public and scholarly discussion in Sweden. This is an 
indication that the interpretation and demarcation of the modern concept of 
allemansrätten in Finland was influenced by the Swedish discussions, which is 
also confirmed by the legal scholarly discussion on access rights to public and 
private spaces in Finland. 

Even though the digitised newspaper collections are incomplete, it is impor- 
tant to confirm the results with the newspaper material and use it as a parallel 
dataset to control possible quality problems related to the parliamentary data. 
I conducted similar keyword searches in the Finnish National Library news- 
paper dataset until 1929, and used the Sanoma Digital Archives, which host a 
handful of major Finnish newspapers with national coverage, such as Helsingin 
Sanomat and Iltasanomat. 'The results are very similar. From 1900 until 1929, 
there are no appearances of allemansrätt* (or alle mans rätt*), and I found uses 
of ‘var man’ (everyman) similar to the Finnish-language use of ‘everybody’ and 
'common person. The Sanoma Archives' sources also demonstrate the peak in 
the 1940s with references to the 'everymans right to fish: The first uses of the 
modern term “jokamiehenoikeus’ (allemansrätten) are found in the 1960s, and 
the term becomes common in the 1970s material. The first reference from 1962 
is in a letter from a reader to the newspaper Helsingin Sanomat about the recent 
private road legislation and whether walking on private roads was permitted.” 

It seems clear, then, that the term allemansrätten became commonly used 
only after the 1960s. How stable have the modern uses been? In the following 
section, I will explore more carefully the discursive environments where the 
modern term has been used by contrasting the parliamentary debates of 
the 1970s with those of the early 1990s—a moment of Finland economic and 
political opening, for example, concerning the EC/EU membership. 


Allemansrätten since the 1970s: Mapping the Shifts in 
Discursive Environments 


Today, allemansrätten is a well-known concept which has extended outside its 
core meaning of public access rights to nature. Allemansrütten has been used in 
other areas than nature to designate the importance of public access rights—for 
example, how ‘public libraries are an everymans right. Moreover, the concept 
has been branded as something uniting the Nordic countries, but also represent- 
ing several key values of these societies, such as Nordic freedom, clean nature 
and equality. What can we learn about the uses of the concept and its expan- 
sion beyond a mere ‘right to roam by looking at the parliamentary debates? 

In this section, we investigate and contrast the uses of allemansrätten at two 
moments in time: in the early 1970s, when allemansrátten was becoming a 
commonly used concept, and in the early 1990s, when the term had become 
an irreplaceable part of discussion about public access rights to nature. This 
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is achieved by studying the co-occurrence of allemansrätten with other terms 
in the two moments in time with collocation analysis and topic modelling. As 
the focus is on public discussion, the minutes of the parliament are used in this 
section. Moreover, the uses of the Finnish term jokamiehen/oikeus only are fol- 
lowed, as the majority of the parliamentary speech was in Finnish. 

When looking at the debates of the 1970s, concordance searches? for the 
term reveal important concentrations of allemansrátten in 1973, 1974, 1976 and 
1978. They help to identify the key legislative debates in which allemansrátten 
was used and became defined. In the early 1970s, the new Outdoor Recreation 
Act was discussed and enacted in Parliament. In 1974 and 1976, the debates 
regarded the expropriation law and, in 1978, the public use of nature was a topi- 
cal part of the legislative work on the chemical treatment of forest vegetation. 

In the 1990s, several concentrations of the use of the term are found with 
concordance searches. As expected, the clusters in the 1990s are more numer- 
ous and regard a broader variety of themes. In the 1970s, 170 concordance 
hits are found, which consist of over 90 hits about the Outdoor Recreation Act 
only and of two smaller concentrations of 10 to 20 hits. In the minutes of the 
1990s, on the contrary, 401 hits are found, which are divided rather equally 
in smaller clusters appearing every year. Most annual hits (56) are found in 
the 1996 minutes. In the 1990s, allemansrätten appeared as part of debates on 
nature use and the natural environment, such as hunting and fishing laws and 
natural protection legislation. Allemansrätten was, however, taken into political 
debate also in relation to questions about Finland's international relations: the 
EC/EU membership and legislation concerning foreign ownership in Finland. 
Allemansrätten, therefore, was used in the 1990s more broadly than merely in 
the context of access rights in the natural environment. 

If we move closer to the level of text, this broadening in the uses of the term 
becomes more visible. For the comparison, I formed two equal-size debate cor- 
pora (about 250,000 characters), which were about similar legislative topics. 
Similarity of the corpora was sought to minimise the effects created by the mere 
variance in the legislative topics discussed at the two moments in time. The 
1970s debates were used as a starting point, and were contrasted with the leg- 
islative projects of the 1990s, which regarded environmental protection, fish- 
ing and recreational use of nature.** Furthermore, the corpus was cleaned for 
obtaining more accurate results for the analysis of co-occurrences: first, the 
words in the two corpora files were lemmatised with the LAS tool presented 
in ‘Digitised Parliamentary Talk of the 20th Century; above. Second, the texts 
were trimmed by removing other characters than alphabet letters and deleting 
the names of the MPs and commonly repeated phrases, such as the greetings 
addressed to the speaker at the beginning of talks. 

When looking at the collocates of ‘jokamie” (everyman*) (statistically the 
most commonly appearing terms with 'jokamie*), we find that allemansrätten 
retains a common core, but has been used to deliver rather diverse messages. 
The most frequent collocates shared by both corpora include ‘right, ‘citizen, 
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‘nature; ‘ice / lure fishing’ and Finland; but also “Nordic. When we look at dif- 
ferences in collocation (collocates that are not found in the other corpus), the 
picture becomes nuanced. There are collocates which related mainly to the dif- 
ferences in law projects discussed (even though similar debates were selected 
for the comparison), such as ‘recreational use’ or ‘chemical treatment product’ 
in the 1970s. 

There seem to be differences in relation to how internal or external the uses 
of allemansrätten have been. The collocates in the 1970s point to the discovery, 
stabilisation and internal debate about the concept. The most frequent collo- 
cates include, for instance, the terms ‘age-old; ‘property right’ / ‘property’ and 
‘landowner, and among the statistically most common one finds ‘historical, 
‘heritage’ and ‘socialise. In the 1990s, the horizon seems to be broader, and 
allemansrätten appears rather as something that is being challenged by or being 
related to the outside world. In the 19905, the statistically most common col- 
locates include ‘unique, ‘outsider’, ‘trample’ and ‘spoil’ and also terms related to 
space, such as ‘international; ‘Europe; European and ‘integration. Even though 
the 1990s corpus does not include direct debates about EC membership, the 
discussions on allemansrätten seem to generate questions related to the open- 
ing of the borders. Similar fears about the overuse or the weakening of the 
access rights to nature due to EC membership were also raised in neighbouring 
Sweden in the early 1990s. 

Finally, the differences in co-occurrences between the two corpora are stud- 
ied by using topic modelling. Topic modelling is a method where separate top- 
ics (‘patterns of tightly co-occurring terms’) are detected in the text corpora 
through probability analysis. Topic modelling has proved to be a powerful 
tool, especially when organising and classifying a large quantity of text docu- 
ments." In this section, we do not examine the different topics that are found 
in the corpus, but we study in detail the topics in which allemansrátten appears. 
The topic modelling was carried out using the MALLET tool. MALLET 
includes automatic removal of stop words, and after testing several shares of 
topics, MALLET was run to find 80 topics in both corpora.? Besides building 
the topics, MALLET produces an output file, where the words in the corpus are 
annotated by their topic number. The results of jokamie*’ (everyman*) appear- 
ing in different topics is presented in Figure 11.3. 

As illustrated very clearly in the graph above, MALLET clustered alleman- 
srätten mainly to two different topics. To a large extent, this results due to the 
functionalities of the method, as the data had been lemmatised and the written 
word forms used in the 1970s and 1990s were different: topic 49 includes the 
term ‘jokamies’ (everyman), whereas topic 76 includes the term ‘jokamiehe- 
noikeus' (allemansrätten), which, as we have seen, had become the common 
written form in the 1990s 

In addition to this, we can discern again a different discursive environment 
for the two topics and the uses of allemansrátten in the 1970s and 1990s. Topic 
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Figure 11.3: Topics in which the term jokamie" (everyman*) was grouped in 
the topic modelling (per corpus). Source: Author. 


49 is mainly about allemansrátten: the main keyword given in topic 49 is ‘joka- 
mies (everyman). This topic includes terms about political debate (political 
parties) on nature, nature use and access rights. Topic 76, on the other hand, 
does not have 'jokamiehenoikeus' (allemansrätten) as the main keyword, but 
the first words of the topic are ‘talk, damage; ‘needs to’ and ‘pay, and then 
‘jokamiehenoikeus’. Further down the list of topic terms we find ‘damage’ ‘prob- 
lem, ‘forest fire, “Finnish, ‘companies’ and ‘integration. No names of political 
groups are listed in topic 49, and its terms are more related to necessity and 
change, rather than political negotiation or argumentation. 

It seems, then, that allemansrätten expands beyond its ‘traditional’ range of 
reference in the period. In the 1970s, allemansrátten appears as part of political 
debate in which the concept was contested, but the contestation limited itself 
to the question of access rights to nature. The discussion focused on alleman- 
srätten itself and its roots. Based on a close reading of the corpus, we can high- 
light as an example the comment put by the agrarian party MP Mikko Kaarna 
in 1973: ‘whatever is meant by this everyman’s right, it seems that everybody 
aims to interpret it in his own way, some of us in a very broad sense.“ After 
this, Kaarna referred to the laws which already regulated outdoor access. In the 
1990s, however, allemansrätten conveyed meanings in the political debate that 
were matters external to access to nature, and rather related to international 
relations and questions of national identity. For instance, in the debate on the 
law on natural protection, agrarian party MP Markku Koski emphasised ‘how 
the right, that was very broad for the Finns, should be valued, namely alleman- 
srütten, which does not exist in other European countries to the same extent as 


in Finland. 
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Conclusion 


This digital history of allemansrätten in the recently digitised documents of the 
Finnish Parliament has allowed light to be shed on the trajectory of the Finnish 
allemansrätten. The methods used in this chapter (keywords search and col- 
location analysis) have been useful for identifying key debates and studying 
long-term changes in those debates related to allemansrátten. The chapter has 
shown how the vocabulary of ‘everyman’ was already being used extensively 
before the term allemansrätten came into use. The introduction of fishing rights 
in the early 1940s presents an important turning point, and depicts key ten- 
sions in access rights to nature. Universality in the fishing rights, which is at the 
core of today's allemansrátten, was criticised with references to the dishonest 
‘vagabond fishermen and damages done to the honest locals.'* In fact, even 
though voices were heard for making fishing permanently open for every- 
one, 'everymans fishing’ was soon limited to allow fishing in the areas where 
the 'everymen' resided. Importantly, economic reasons, including economic 
distress, were behind the decision to open the fishing rights. In a similar way, 
the discussions on wild berry picking in the late 19th century regarded the pos- 
sibilities for the poor of the rural areas." Only when strangers outside the local 
community arrived as ‘everymen’ to the woods and lakes did it become crucial 
to define the limits of access rights: the broadening (and creation) of the right 
was done for economic reasons, not based on a cultural tradition. 

The Finnish vocabulary of ‘allemansrätten’ slowly began to point to universal 
access rights to nature after the 1950s. In fact, at some point in the 1950s, two 
conceptual traditions became united. The modern institution of allemansrätten 
in Finland was modelled after the Swedish example, yet, the Swedish term was 
not translated into Finnish, but the Finnish expression of common man's and 
‘normal access to nature used for wartime fishing rights was taken into use. In 
the 1970s, then, allemansrätten was an established and commonly used concept 
in Finland. It appeared in the political debates of the decade, but mainly inter- 
nally, as something at the centre of attention itself: a ‘national heritage that was 
discussed and defined by the different national political groups. In the 1990s, 
the uses of allemansrátten had expanded to various legislative debates. With the 
normalisation of the concept, allemansrätten could be used externally, to con- 
vey meanings related to national values outside the sphere of access rights to 
nature. In the context of European integration in the early 1990s, it was used 
to defend a particular Nordic way of life in contrast to European practices of 
ownership. It is this kind of national symbol, how it is understood (or felt) 
by the public, which is presented to foreign visitors and used in the branding 
of the Nordic countries today.“ 

In general, the digitised parliamentary data opens up a new research horizon 
on the public matters of the 20th century. The data complements the digitised 
national newspapers, which are available comprehensibly for the first three dec- 
ades of the century only. Yet, the Finnish case presents shortcomings that can 
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render digital historical research cumbersome and uninviting: the parliamen- 
tary data is not easily accessible through the current web user interface of the 
Finnish Parliament. * By their nature, parliamentary documents are structured 
according to time, theme and speaker, and basic search tools and filters that 
enable the use of these features would satisfy the needs of most historians.” In 
this chapter, I did not use the web interface, but applied textual analysis meth- 
ods on the parliamentary data. These methods are applicable by historians with 
basic computational skills; however, a significant amount of data work is neces- 
sary when using such non-structured and partly weak quality data. It seems to 
me, therefore, that digital historians should pay extra attention to the workload 
and the trade-off related to the new digital sources: how much can be done 'eas- 
ily with the existing resources by the historian, and what data and development 
work can and should be left for broader cooperative projects? Moreover, this 
also implies that historians should actively partake in the digitisation processes 
of the key historical sources and the development of the related user interfaces. 
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1935 (Minutes 1935 IT). Helsinki: Valtioneuvoston kirjapaino, 1936, p. 1623. 

Brofeldt 1943. 

“.. elintarviketilanne ei vielä ole korjautunut sellaiseksi, että voisimme 

kieltää jokamieskalastajilta mahdollisuuden omalla työllään ansaita 

itselleen vähäisen särpimen lisän? Minutes 1945 II, p. 2187. 

‘Minun nähdäkseni kalastusoikeus pitäisi saada vakiinnutetuksi, ei 

ainoastaan poikkeukseksi, vaan yleiseksi laiksi, että kaikilla kansalaisilla 

olisi kalastusoikeus? Minutes 1945 II, p. 2189. 

3 “Tosin on vanhastaan katsottu, että maantavan mukaan on olemassa ns. 

jokamiehen oikeus kulkea toisen maalla ilman omistajan lupaa? Docu- 

ments of the Parliamentary Session of 1964 (Documents 1964 V), Ouestion 

no. 67, p. 2. 

Legal scholar V. K. Noponen was already using variants of allemansrätten in 

Finnish in his work on public and private roads in the 1940 and 1950s. Nop- 

onen also refers to concepts developed by the legal scholar S. Ljungman, 

who was among the first to present this 'newly-found catchphrase with legal 

value, allemansrätten. La Mela 2016: 216. 

? Laki yksityisistä teistä. Helsingin Sanomat, no. 264 (30 September 
1962), p. 31. 

?* See, e.g., Sandell & Svenning 2011. 

? The AntConc corpus analysis tool is used for concordance searches and 

file views. The concordance search was used to detect keywords in the data 

and study their uses on sentence level. See http://www.laurenceanthony.net 

/software/antconc/. 

The selected law projects from the 1990s regard: state outdoor recrea- 

tion area in Teijo, natural protection act, remuneration of environmental 

damage and the fishing law. 
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? A range of 10 words before and after the term was used to study collocation. 
^" Blei 2012. 

“ See, e.g., Weingart & Meeks 2012; Wehrheim 2018. 

% MALLET toolkit, http://mallet.cs.umass.edu/. 

? A different number of topics were tested manually. The number of topics did 
not affect the clear distinction between topics presented in Figure 11.3. A 
larger number of topics was preferred for producing more nuanced topics. 
"Mitä tällä jokamiehen oikeudella tarkoitettaneenkaan, niin näyttää siltä, 
että jokainen pyrkii sitä tulkitsemaan omalla tavallaan, monet hyvinkin 
laajasti? Minutes 1973 II, p. 1454. 

"Mielestäni Suomessa nykyisin on pidettävä arvossa sitä oikeutta, mikä 
suomalaisilla on hyvin laajasti, eli jokamiehenoikeutta, jota ei muissa 
eurooppalaisissa valtioissa siinä mittakaavassa ole kuin Suomessa? Minutes 
1996 II, p. 1700. 

Minister of Agriculture Eemil Luukka on the law proposal on temporary 
fishing rights: <.. jokamieskalastajat tästä lähtien saavat harjoittaa pyyn- 
tiään vain vakinaisen tai tilapäisen asuinpaikkansa lähivesistössä. Tämän 
rajoituksen kautta on tahdottu estää sellainen kulkurikalastajien toiminta, 
joka juuri on osoittautunut kaikkein haitallisimmaksi paikallisten asuk- 
kaiden kalanpyynnille, olkootpa he sitten kalastusoikeutta omaavia tai sitä 
vailla olevia? Minutes 1945 II, p. 2187. 

La Mela 2016. 

‘8 Mission for Finland 2010; Tuulentie & Rantala 2013. 

® Happily, the situation is improving. The research consortium “Semantic 
Parliament” which aims to produce a linked open data and research infra- 
structure on Finnish parliamentary data, started its work in January 2020. 
See https://seco.cs.aalto.fi/projects/semparl/en/. 

The digitised Canadian parliamentary debates webpage provides a very 
balanced and user-friendly interface for searching and browsing the parlia- 
mentary debates. See http://www.lipad.ca/. See also Beelen et al. 2017. 
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CHAPTER 12 


Evolving Conceptualisations 
of Internationalism in the UK Parliament 


Collocation Analyses from the League to Brexit 


Pasi Ihalainen and Aleksi Sahala 


Introduction! 


While the history of the diplomatic events and institutions of 20th-century 
international politics has been comprehensively explored,? macro-historical 
and long-term computer-assisted analyses of conceptualisations of the ‘interna- 
tional’ have not yet been attempted. With the increasing availability of digitised 
parliamentary debates, such an analysis of the everyday language of politics 
has become possible. In the conceptual history of internationalism, focus on 
Parliament is particularly pertinent in the British case, as the country has been 
one of the most active agents in the field of international cooperation while 
regarding Parliament as the forum of ideological debate. Parliament has had a 
say in foreign policy, too, regarding membership in international organisations. 

After having previously analysed parliamentary debates with more conven- 
tional close-reading methods of the history of political discourse,’ we turn here 
to text analysis programmes to explore their benefits for conceptual history. 


How to cite this book chapter: 

Ihalainen, P, & Sahala, A. (2020). Evolving conceptualisations of internationalism in 
the UK Parliament: Collocation analyses from the league to Brexit. In M. Fridlund, 
M. Oiva, & P. Paju (Eds.), Digital histories: Emergent approaches within the new 
digital history (pp. 199-219). Helsinki: Helsinki University Press. https://doi 
.org/10.33134/HUP-5-12 
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Our goal is to reconstruct meanings assigned to international issues in the UK 
Parliament in the long 20th century: How has the 'international' been experi- 
enced, understood, conceptualised, constructed, debated and redefined? How 
and why has the ‘international’ been given meaning and implicitly defined 
through its use in a variety of (ideologically motivated) political arguments, 
particularly in connection with membership in international organisations? To 
what extent and when, how, why and with what consequences has this attribute 
turned into an 'ism'? 

We argue that the distant reading of extensive series of digitised UK parlia- 
mentary debates by the means of a collocation analysis helps to extend and 
deepen conceptual analysis so that previously unnoticed ways to discuss inter- 
national cooperation can be discovered and the close reading of sources is more 
effectively focused. We supplement the analysis of parliamentary discourse at 
macro level with collocation analyses of particular debates and contextualised 
conceptual analyses in concrete speaking situations. The latter correspond with 
the criteria of historical research for understanding meaning created in specific 
contexts and provide checks to premature conclusions drawn on the basis of 
computer-assisted distant reading. In this exploration, conclusions of the dis- 
tant reading remain suggestive so that problems rising from decontextualised 
interpretations can be pointed out. Thus, our investigation provides an example 
of interaction between text analysis programmes and an analytical mind famil- 
iar with the genre and discourses of the primary sources. 

While doubts about the application of collocation analyses to intellectual his- 
tory have by no means been overcome,’ in corpus linguistics they have been 
used productively." For sociolinguists, collocation is ‘an accepted, linguistically 
meaningful measurement” referring to ‘the co-occurrence of two words within 
a pre-specified span, when the frequency of the co-occurrence is above chance, 
taking into account the frequencies of the "node" (the word in focus), its col- 
locates, and the collocation itself?” Applications thus far include a diachronic 
analysis of UK parliamentary speaking on Ireland,’ an analysis of a parliamen- 
tary debate on the climate change’ and an analysis of adjective collocates quali- 
fying capitalism. Foxlee has aimed at combining more semantically and more 
pragmatically oriented versions of conceptual history with computer-assisted 
text analysis." Guldi has demonstrated how word counts and text mining pro- 
duce indices of historical change." Lahdesmaki and Wagenaar have used collo- 
cation analysis to explore discourses of diversity within the Council of Europe 
by grouping the key terms in semantic fields (collocation networks) and meas- 
uring the frequencies of those fields in order to reveal how concepts were pro- 
duced as policy.” The current authors share such an understanding of politics 
as primarily discursive and of the need to focus on conceptual innovations and 
active uses of language aimed at affecting policies." 

The Hansard Corpus (https://www.english-corpora.org/hansard/) contains 
nearly every speech given in the UK Parliament between 1803 and 2005, 
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1.6 billion words in total, and allows researchers to search on parliamentary 
debates, including collocation searches. While the scanned records have been 
proofread and optical character recognition (OCR) is not an issue, something 
like 5% of the debates have not been included in the Historic Hansard database 
(https://api.parliament.uk/historic-hansard/index.html) from which the data 
originates, which causes some uncertainty with search results. We started our 
distant reading with a collocation analysis of the noun collocates of 'interna- 
tionalism based on the tool of the Hansard Corpus, quoting selected examples 
from Historic Hansard to give more concrete content to the discernible trends. 
We then proceeded to collocation analyses of the entire vocabulary of ‘inter- 
national’ in a selection of Commons (HC) and Lords (HL) debates concerning 
British membership in international organisations. 

For the collocation analysis, we used a measure called PMI?,'* which is a less 
low-frequency sensitive improvement of the Church and Hanks' word associa- 
tion measure built around the idea of Pointwise Mutual Information (PMI). 
The core idea of the PMI-based measures is to divide the corpus into forward- 
looking or bi-directional windows of fixed size, which define the maximum 
distance between the keywords and their possible collocates. The keywords are 
paired with each word that can be found within the defined window size, and 
the actual joint probability of each pair is compared to the expected probability 
of those words co-occurring independently. The maximum score of 0 indicates 
that the words are only found together, and the minimum of - that the words 
never co-occur within the given window size. For calculating the scores, we 
used a Python script called pmizer, which is an open source script for calculat- 
ing different PMI-based association measures from tokenised text.'* We did not 
lemmatise our data, as we wanted to preserve singular and plural forms sepa- 
rately. To avoid our data being overcrowded with conjunctions, prepositions 
and pronouns, we filtered most of these out by using a simple stop-word list. 

For the analysis of membership debates, there was no need to limit search 
terms as single debates varying from one to a few days were in question, and 
hence all references to the ‘international’ could be considered. A broad col- 
location window of 10 words both ways was used to discover every politically 
significant noun associated with the ‘international’ The scores calculated with 
PMP are reported below in the form '(number of collocates within the span of 
ten words both ways/score/distance): 

Politically interesting collocates picked from result lists that ranked the close- 
ness ofthe collocations on the basis oftheir score were grouped into collocation 
networks and their relative importance in both Houses discussed. While col- 
locations are usually considered statistically significant when they appear in the 
corpus at least twice, individual combinations of words also deserve attention 
as politically potentially meaningful innovative speech acts. The next analysis is 
primarily based on distant reading, though some general context is introduced 
to support interpretation. A close reading of some findings will follow. 
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AnternationalisnY in the UK Parliament, 1803-2005 


As the Hansard Corpus is so extensive and as the use of 'international' is often 
technical rather than ideological (referring to aviation, for instance), our dis- 
tant reading focused on collocations of 'internationalism: The total number 
of co-occurrences of ‘internationalism’ in the Hansard Corpus with a nine- 
word collocation window is 1,542, with an emphasis on the 20th century and 
especially in the interwar period and the 1970s and 1960s. This leads to a man- 
ageable amount of results, even if ones that focus on the 'extreme forms of 
international thinking that are expressed as an ‘ism’ word. 

The noun collocates of ‘internationalism that were considered politically 
meaningful were divided into 13 loose semantic fields (groups of related 
terms, topical sub-categories or collocation networks), namely nationalism, 
party, socialism/labour, spirit, peace, democracy, imperialism, cosmopolitan- 
ism, globalism, collaboration, institutions, supra-nationalism and capitalism. 
The diachronic frequencies of the six most important of these semantic fields 
are visualised in Figure 12.1. The grouping of the terms was intuitive, building 
on previous empirical analyses of discourses on internationalism. Bringing in 
the historians subjective mind in this way helped in discerning relevant topics 
among diversified discourses. 

The diachronic frequencies point at the centrality of discourses on patriot- 
ism and nationalism for conceptualisations of internationalism. As Clavin and 


Table 12.1: The number of collocations of ‘internationalism’ in UK parliamen- 
tary debates according to the Hansard Corpus. 


2000-2005 


Source: Author. 
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Figure 12.1: Politically significant collocate groups of ‘internationalism in the 
Hansard Corpus: nationalism, party, socialism (including the vocabulary of 
labour), spirit, imperialism, and cosmopolitanism. The visualisations were 
made by Kimmo Elo. Source: Author. 
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Sluga have argued, for much of the 20th century, internationalism was unthink- 
able without nationalism; nationalism was the basic premise, not a mere 
counter-concept of internationalism.” The vocabulary of patriotism contrasted 
with socialist internationalism in the British parliament already in 1882 as 
P. J. Smyth (Home Rule Party) criticised Irish agitation of ‘the substitution of a 
vague, sickly, and godless internationalism for the manly patriotism' (HC Deb. 
9 March 1882 vol. 267 c524; the references from now on are to volume and 
column numbers (c) in accordance with the conventions of British parliamen- 
tary debates). Patriotism or nationalism and internationalism (59 collocations) 
became increasingly associated after the First World War as the promoters of 
the League of Nations tried to reconcile these ways of thinking. Yet, J. D. Rees 
(Conservatives), a former colonial administrator, pointed at tensions between 
patriotism or nationalism and (socialist) internationalism, concluding that ‘[i] 
nternationalism means the negation of patriotism and the abnegation of eve- 
rything of which we should be proud. Instead of extending internationalism 
I long myself to see it abolished completely off the face of the earth’ (HC Deb. 
1 November 1920 vol. 134 c106). Internationalism could only be based on 
nationalism, as Goronwy Owen (Liberals) put it: ‘I have no sympathy at all with 
the people who preach internationalism as such ... The basis of a proper inter- 
nationalism is a good nationalism ..? (HC Deb. 27 April 1928 vol. 216 c1285). 
Morgan Jones (Labour), a pacifist, agreed in the early 1930s as every Euro- 
pean state was developing ‘not towards a growing internationalism but towards 
essential nationalism’ (HC Deb. 10 May 1932 vol. 265 c1837). Doubts about 
internationalism continued after the Second World War, especially among 
non-socialists: Ralph Rayner (Conservatives) was ironical when pointing out 
that ‘Russia is still internationalist in so far as internationalism will serve her 
nationalism' (HC Deb. 20 February 1946 vol. 419 c1192). 

Only from the mid-1950s can we find Conservatives conceptualising inter- 
nationalism in more positive terms. Peter Smithers, a British delegate for the 
Council of Europe, believed that '[a] nationalist war today is a physical impos- 
sibility; and economically and socially the temptations of the benefits of inter- 
nationalism are so great that nationalism in itself is no longer a very attractive 
proposition (HC Deb. 27 July 1955 vol. 544 c1288). More radical challenging of 
nationalism dated from the time of the EEC membership as the young David 
Owen (Labour) declared: “When I talk about European unity, I am talking in 
part about our concept of nationalism and internationalism. I find that one of 
the most dangerous facets of modern life and, indeed, of our history over the 
last 50 years is the scar of nationalism. I believe in internationalism as an article 
of faith' (HC Deb. 26 October 1971 vol. 823 c1634). By 1981, even the former 
Conservative MEP Hugh Dykes argued that ‘Britain has always been interna- 
tionalist in its nature. We changed the orientation of our internationalism by 
entering the Community in 1973. I wish that new, modern internationalism 
and Europeanism to continue, for the benefit of future generations' (HC Deb. 
8 April 1981 vol. 2 c1000). Such positive associations between integration and 
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internationalism, though interesting with hindsight, were not mainstream in 
the 1970s and 1980s either. 

Conservative and liberal suspicions about internationalism had traditionally 
risen from its associations with socialism, as is revealed by 52 collocations of 
party vocabulary and internationalism, especially in the 1910s and 1920s, but still 
in the 1970s and 1980s as well. Party perspectives include associations between 
internationalism and socialism/socialist(s) (20), workers (10), labour (7), 
revolution (1), Marxism (1), bolshevism (1) and communism/communist(s) (2). 
Socialist or labour internationalism had emerged in the mid-19th century with 
the First International,? came up in the British parliament in the 1870s and 
was welcomed by 1911 as Josiah Wedgwood (Liberals) observed how '[i]nter- 
nationalism is spreading rapidly, not only the internationalism of capital but 
the internationalism of labour (HC Deb. 14 December 1911 vol. 32 c2625). The 
founding of the International Labour Organisation (ILO) in 1919 activated this 
discourse, although typically the political rivals attacked the internationalist 
background of the Labour Party. By 1950, Joseph Kenworthy, a Labour peer, 
nevertheless declared that ‘the future of mankind lies in internationalism: That 
word "internationalism" is a word we do not hear nearly often enough to-day: 
Real internationalism properly applied would have avoided the terrible catas- 
trophes of the world wars of this century and have raised the standard of life of 
the whole of humanity' (HL Deb. 28 June 1950 vol. 167 c1186). Philip Russell 
Rea, a Liberal peer, also came up with ideas about ‘more internationalism, some 
relinquishment of national sovereignty’ as ‘necessary in the modern world’ (HL 
Deb. 2 November 1960 vol. 226 c52). Ronald Leighton, too, believed in inter- 
nationalism, but added in line with the interwar prioritisation of nation states: 
"Ihe word "inter" means between: Instead of supra-nationalism, I want to see 
a group of independent, self-governing countries co-operating together' (HC 
Deb. 21 May 1984 vol. 60 cc730-731). 

The spirit of internationalism (the fourth most common collocate of interna- 
tionalism) had been discussed before the First World War as Philip Snowden 
(Labour), an anti-capitalist trade unionist, declared that ‘we believe in the spread 
of a spirit of internationalism and urged Britain to lead ‘a great international 
league of peace’ (HC Deb. 15 March 1910 vol. 15 c308). The vocabulary of this 
discourse included collocations with spirit (24), principle(s) (22), idea(s) (17), 
sense (10), ideal(ism) (10), word(s) (13), belief/believer (12), concept (11) and 
values (7), as well as thought, thinking, theories, enthusiasm, vocabulary 
and term. There would seem to have been a slight rise in the ‘idea’ of interna- 
tionalism between the 1920s and 1940s and again in the 1970s and 1980s, as 
in the general intensity of internationalism discourse. The ‘spirit’ of interna- 
tionalism peaked from the 1930s to the 1950s, ‘principles’ peaked in the 1950s. 
Critique against the ‘idealism of internationalism appeared between the 1920s 
and the 1950s and again in the 1970s, in the same period as ‘values’ were dis- 
cussed. ‘Sense’ and internationalism were co-textualised a few times from the 
1920s to 1940s and again since the 1970s, but with diminishing frequencies. 
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Explicit references to the spirit of internationalism were not so many as could 
be expected on the basis of Britain's role as a herald of liberal international- 
ism. In the aftermath of Hitler's accession to power, James Henderson Stewart 
(National Liberals), a supporter of Anglo- American cooperation, nevertheless 
assured that the British government had done more than any other ‘to establish 
the spirit of internationalism' (HC Deb. 27 April 1933 vol. 277 c366). After the 
Second World War, the British spirit of internationalism was emphasised every 
now and then. In 1956, Lord Rea demanded that 'international questions must 
be handled with international, and not with national, mentality’ and that ‘we 
must recognise this spirit of internationalism much more than we have done in 
the past’ by joining European organisations (HL Deb. 15 March 1956 vol. 196, 
cc461-462). Reginal Prentice (Labour) welcomed development aid as a way of 
"building up a spirit of internationalism which can be a factor towards world 
peace (HC Deb. 25 April 1961 vol. 639, c303). 

Internationalism has customarily been associated with peace. This discourse 
emerged with the League of Nations and peaked in the 1930s. Dennistoun 
Burney (Conservatives), an aviation expert, summarised the logic of interna- 
tionalism from the point of view of national, European and imperial security: 
^... if you are to have peace you must have internationalism, and if you are 
to have internationalism, you can only have it by abrogating to some extent 
the sovereign rights of each nation and at the same time restricting the free- 
dom of the elective assembly of each national Government (HC Deb. 7 March 
1929 vol. 226, c670). By the 1930s, the problem was, according to Seymour 
Cocks (Labour), that Germany 'removed pacifism and internationalism from 
her vocabulary' (HC Deb. 13 November 1933 vol. 281, c665) so that, for Ralph 
Rayner (Conservatives), it already appeared as ‘extremely dangerous to teach 
pacifism, internationalism, and the brotherhood of mar (HC Deb. 14 June 1937 
vol. 325, cc112-113). Discourse on the spirit of internationalism nevertheless 
emerged in the interwar era, peaked with the creation of the United Nations 
and became again rarer in the 1950s and 1960s. After EEC membership, the 
UK Parliament appeared quantitatively at its most 'internationalist'; thereafter, 
collocations between spirit and internationalism have declined. Democracy 
was associated with internationalism in the parliamentary context mainly in 
the 1930s and 1940s, although Francis Pym (Conservatives), a former foreign 
secretary who opposed Thatcherism, argued boldly in 1987 that '[t]he world is 
interdependent, and internationalism must be nurtured in every democracy’ 
(HC Deb. 7 April 1987 vol. 114, c196). 

International issues could be conceptualised in further alternative ways. The 
British parliamentary elite had seen the League of Nations as supportive of 
the interests of the Empire." They often discussed imperialism/imperialist(s) 
(commonwealth, empire, colonialism) and internationalism in the same con- 
text, with a break in the 1950s when decolonisation had started, reflected on the 
topic occasionally from the 1960s to the 1980s, and then dropped it from their 
vocabulary. Cosmopolitan ideals of internationalism— consisting of a variety 
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of notions ranging from brotherhood, altruism, solidarity, humanitarian issues, 
aid and assistance to friendship, neighbourliness, fellowship and reciprocity— 
were also defended between the 1920s and 1940s. After the Second World War, 
this discourse became marginal, only to peak in the 1970s and 1980s in a rising 
internationalist atmosphere supportive of development and humanitarian aid, 
and losing popularity from the 1990s onwards. This period saw the emergence 
of a normative discourse on cosmopolitan democracy in political science, but 
such theories did not find their way to Parliament. The 1990s did see the emer- 
gence of the alternative discourse on globalisation, but only in three colloca- 
tions with internationalism. 

A discourse on collaboration (1) or cooperation (1) as internationalism 
surfaced by the end of the century, but as the low frequencies show, remained 
surprisingly marginal. Conventions on human rights, including that of the 
Council of Europe, appeared to Oliver McGregor, an economic historian, as 
‘the most remarkable features of recent history, a triumph for reason, co- 
operation and internationalism' over nationalism (HL Deb. 16 December 1987 
vol. 491, c729). EEC membership supported discourse on systems, organisa- 
tions, institutions and associations on the one hand and internationalism on 
the other. Yet, such debate withered away by the early 2000s, which may be 
reflective of the lack of commitment to the institutions of the community. The 
membership gave rise to entirely new debates on the relationship between inte- 
gration (3), union (6), community/ies (11), Europeanism (3), supra-national- 
ism (4) and internationalism as well, but not to any great extent. In the mean- 
time, the membership does not seem to have made such a great difference in 
associations between internationalism and markets (including trade(s)/trading, 
capital/capitalism/capitalist(s), economy, finance, growth and competition), a 
discourse that had existed before the First World War and been on a higher 
level in the interwar era. The global free trade visions of British politicians do 
not seem to have changed much with the post-Second World War economic 
integration: the EEC, too, was mainly conceptualised as a question of markets. 


Debates on Membership in International Organisations 


Next, we shall complement the above distant reading of trends in discourse 
on internationalism with analyses of collocates of the ‘international’ in entire 
parliamentary debates that concerned the British membership in international 
organisations—as key moments of discourse on the ‘international. The colloca- 
tion analyses enabled a type of ‘topic modelling’ of the contents of the debates 
so that the results were not determined by previously selected search terms 
such as 'internationalism only. 

The selected membership debates concerned the League of Nations (LoN, 
in the Commons on 21 July 1919, in the Lords on 24 July 1919), the United 
Nations (UN, in both Houses on 22 August 1945), the Council of Europe (CoE, 
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debated only in the Commons on 13 November 1950), the European Economic 
Community (EEC, several days in February 1972 in the Commons and in July 
1972 in the Lords) and the European Union (EU, a couple of plenaries in Sep- 
tember 2017 in the Commons and in January 2018 in the Lords). A further 
possibility might have been membership in NATO, but the defence alliance dif- 
fered in its character from the other more general forms of international coop- 
eration. It is also debatable whether these are the most representative occasions 
and whether the EU should be seen as a mere ‘international organisation’ or 
rather as a project of transnational integration. Debates on the EEC/EU in par- 
ticular had several stages, and in principle all of these could have been analysed, 
but for the sake of consistency only the second readings of the related bills were 
considered. The second reading is typically the stage of deliberative decision- 
making when most extensive ideological contributions to the debate are made 
and competing arguments presented, reflecting much of what had come up in 
other parliamentary discussions and the public debate. 

The Commons debates on the LoN membership on 21 July 1919 was 
a key moment in the history of British internationalism.” It took place in 
theaftermath ofthe signing of the treaty of Versailles that not only concluded the 
First World War with tough peace terms on Germany, but also introduced 
the League Covenant. The collocation analysis suggests that international- 
ism surrounding the League was conceptualised by the MPs to a great extent 
through the general concept of ‘international law; as could be expected on the 
basis of the British role in drafting the Covenant and the inclusion of the Inter- 
national Court in it. The ‘international’ in the context of the League was about 
‘court’ (4/-1184/2.75), ‘legislation’ (2/-1238/2.0), ‘regulation’ (1/-1238/1.0), 
‘justice’ (4/-1271/2.75) and ‘law’ (1/-1571/1.0). Yet, for a trading nation, 
the ‘international also stood for finances, as reflected by nine close collocations 
of ‘international’ and ‘finance’ (9/-1029/1.0) and two more with ‘financiers’ 
(1/-1338/1.0) and ‘financial’ (1/-1655/1.0). Labour’ had numerous close col- 
locates with ‘international (10/-1220/2.2) due to the connected founding of 
the ILO, aimed at appeasing revisionist Western socialists under the alternative 
of the Communist International.” As an entirely new international organisa- 
tion was being constructed, its institutions were discussed with terms such as 
‘bureau’ (1/-1238/2.0) and ‘machinery’ (3/-1312/1.0). Discourses on ‘experi- 
ment’ (2/-1355/5.0), ‘opportunity’ (3/-1367/5.3) or ‘cooperation (1/-1397/3.0) 
and ‘international’ surfaced, but only rarely. Out of these findings, ‘opportunity’ 
will be analysed in more detail below. 

The Lords paid plenty of attention to the moral aspects of the League, associ- 
ating ‘morality’ (6/-800/1.0) tightly with ‘international’ and connecting ‘morals’ 
(1/-1059/3.0) and ‘sanctity’ (1/-1059/4.0) with it as well. The League was about 
‘treaties’ (2/-1059/6.0), ‘jurisprudence’ (1/-1159/1.0), ‘court’ (2/-1178/1.0), 
‘sanction (1/-1217/1.0), ‘justice’ (2/-1259/1.0), ‘code’ (1/-1259/1.0), ‘rules’ 
(1/-1259/6.0), law’ (1/-1391/1.0) and ‘treaty’ (1/-1611/5.0), the total number of 
legal collocates rising to 13. Reflective of the more value- than interest-directed 
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discourse (different from the down-to-earth approach of the Commons) is talk 
about ‘international spirit’ (2/-1117/1.5), which was reinforced by synonymous 
collocates such as ‘friendship’ (1/-1059/7.0) and ‘reconciliation’ (1/-1159/4.0). 
A major difference was the lack of debate on ‘finances’ and ‘labour, which 
shows how economic and social issues were left for the lower house to deal 
with, appearing as less relevant in the social context of the peers. 

The League was generally regarded as a drastic failure after the Second 
World War, and the British government then agreed with the United States and 
the Soviet Union on the founding of a new international organisation.” The 
Commons debated the UN membership in August 1945, in the aftermath of 
a victory over Nazi Germany and in the shadow of the first military use of the 
atomic bomb in Japan. Despite dissatisfaction with the League, British under- 
standings of the ‘international’ had not changed much since 1919: the UN was 
likewise conceptualised through law, labour and institution. Next-door collo- 
cations of ‘international’ and ‘court’ (5/-1076/1.0), ‘justice’ (6/-1097/2.7), ‘law’ 
(4/-1228/1.0) and ‘lawyers’ (1/-1382/1.0) dominated, and collocations with 
‘conventions’ (2/-1182/2.0) can be added to this discourse on international law. 
Associations with ‘labour’ (9/-1080/1.7) and ‘workers’ (1/-1582/7.0) continued 
to feature, which shows not only that the interests of the working class were 
central to the current Labour government, but also highlights a reaction to 
the strengthened international status of the Soviet Union and the connected 
need to appease the working classes of the West. Discourse on the institution 
focused in 1945 distinctly on ‘control (10/-1126/2.6), ‘security’ (5/-1476/3.2), 
‘machinery’ (2/-1490/1.0) and ‘peace’ (4/-1544/1.0). ‘Economic’ (3/-1500/5.0) 
had rather loose connections with ‘international’ in comparison with the post- 
First World War situation. Collocates deserving further exploration include 
‘collective’ and ‘security’. 

The Lords viewed the UN much like the Commons, emphasising law and 
justice on the one hand and the functioning of the institution aimed at col- 
lective security on the other. The peers associated ‘international’ and ‘justice’ 
(3/-1156/1.0), ‘treaties’ (1/-1292/5.0) or law’ (1/-1509/1.0) and produced col- 
locations of ‘international’ with ‘supervision’ (2/-1192/-1292/1.0), ‘guards’ 
(1/-1192/3.0), ‘operation (2/-1292/3.0), ‘security’ (5/-1341/3.6), ‘peace’ (4/ 
-1386/1.0) and ‘machinery’ (1/-1573/1.0). Noteworthy are close associations 
between ‘international’ and ‘collaboration’ (2/-1151/1.0). Comments on ‘pat- 
riotism (1/-1292/9.0) and the ‘commonwealth (1/-1473/3.0) were also made. 

Also at the formation of the CoE in 1950, the Commons drew predominantly 
from conceptualisations of the international law. "Tribunal (1/-1170/1.0) and 
‘court’ (4/-1202/1.0) had several close collocations with ‘international, and 
collocations with ‘justice’ (3/-1085/2.3) and ‘jurists’ (1/-1170/9.0) appeared. 
Associations between ‘labour’ (4/-1228/3.25) or ‘workers (1/-1370/2.0) 
and ‘international’ remained part of the discourse. Loose associations between 
‘international and ‘continental’ (1/-1540/4.0), ‘Brussels’ (1/-1540/7.0) and ‘fed- 
eral’ (1/-1634/4.0) were emerging, which may be indicative of a tendency to 
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locate the ‘international’ out there in Europe. Economy or trade (1/-1578/9.0) 
had a marginal role in the CoE debates, nor do democracy or human rights 
feature,” which is surprising given the later role of the organisation, and associ- 
ations between ‘peace’ (1/-1609/6.0) and ‘international’ were weak as well. The 
CoE appeared as a further body applying international law. Yet, it might have 
powerful tools of ‘international pressure’ (2/-1070/1.0, the strongest discovered 
association) in its possession, to which we shall return below. 

The EEC, by contrast, was conceptualised in 1972 much less through law than 
economy. For the Commons, the EEC was about markets, trade, business and 
companies—not about law—which suggests a lack of dedication to common 
legislation as an aspect of the European integration. Associations between 
‘international’ and ‘companies’ were exceptionally strong (7/-1022/1.0), but 
the ‘international’ was also associated with ‘monetary’ issues (6/-1171/2.0), 
‘trade (5/-1345/2.6), fund’ (3/-1362/2.0), ‘trading’ (2/-1442/1.0), ‘firm 
(1/-1515/9.0), ‘business’ (1/-1553/1.0) and ‘growth’ (1/-1659/5.0). An associa- 
tion between ‘international’ and ‘continent’ (1/-1529/4.0) can be found, but 
interesting is the remaining considerable semantic distance between 'inter- 
national and ‘Europe (2/-1781/6.5) or ‘European’ (1/-1971/5.0), ‘community’ 
(2/-1903/5.0) or even ‘British’ (1/-1933/3.0). The integration was not that much 
about partnership, with ‘partners’ only passingly associated with ‘international’ 
(1/-1383/9.0). Some associations with ‘scientific’ (2/-1215/6.0) aspects of inte- 
gration appeared, while the defence aspect of the EEC was mentioned only 
in passing (1/-1622/2.0). At first sight, some concern on being located in an 
‘international periphery’ (1/-1283/1.0) would seem to have been expressed, but 
close reading will lead to opposite conclusions. 

The Lords talked much less about law than in connection with previ- 
ous memberships. While ‘international law (1/-1732/1.0) was mentioned, 
‘rights’ (1/-1505/4.0) and ‘rules’ (1/-1543/8.0) had few and relatively weak 
associations with ‘international. The economic aspect was less distinct than 
in the Commons—associations with companies (1/-1432/7.0), ‘capital’ 
(1/-1443/2.0), ‘monetary’ (1/-1512/1.0) and ‘trade’ (1/-1704/1.0) appearing, 
but those with ‘economic’ (1/-1727/5.0) and ‘market’ (1/-1799/9.0) being much 
weaker. Associations with ‘mobile (1/-1073/1.0), ‘cohesion (1/-1173/3.0) and 
‘standards’ (1/-1454/1.0) reflect an understanding of the ideas of the economic 
community, although a need for ‘protecting’ (1/-1073/2.0) would suggest the 
opposite view. ‘International partnership’ (1/-1419/1.0) as an expression made 
an appearance. The semantic distance between both ‘British’ (1/-1779/6.0) or 
‘community’ (1/-1926/2.0) and ‘international’ is observable, just as in the Com- 
mons. Worth close reading is an occasional association between ‘European’ and 
‘internationalist’ (1/-1638/3.0), for instance. 

Debates since the Brexit referendum of June 2016 have been complex and are 
far from completed at the time of writing (August 2018/2019), which means 
that the following remarks necessarily remain provisional. The early Commons 
debates reflected a considerable concern about ‘development’ (4/-990/2.25) in 
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the field of international questions and addressed international 'obligations' 
(4/-1042/1.5) to an exceptional degree. The latter way of speaking can be seen 
as an aspect of the traditional legal discourse on 'treaties' (3/-1209/1.0) and 
law (3/-1657/1.0), which seems stronger at the time of a prospective exit from 
the EU than during entrance negotiations. Yet, the Commons continued to 
understand the EU overwhelmingly through ‘trade’ (9/-1144/1.7, though 
this result is overemphasised by the title "International Trade Secretary’), 
and the prospective withdrawal gave rise to questions about 'tax' (1/-1368/1.0), 
‘taxation (1/-1442/4.0) and ‘customs’ (1/-1690/8.0). “United Kingdom 
(2/-1518/5.5), ‘nation (1/-1534/5.0) and ‘UK’ (2/-1712/4.5) were viewed as 
only slightly more ‘international’ than in the 1970s, and ‘union’ (2/-1750/8.0) 
and ‘European’ (2/-1988/5.5) continued to be dissociated from ‘international’ 
The British MP refrained from conceptualising the EU which they were about 
to leave as ‘international. 

The Lords, known for their more pro-integration stands, were likewise dedi- 
cated to ‘international obligations’ (3/-1243/1.0) and ‘international treaty/ 
treaties’ (4/-1371/-1482/1.0), but also to the ‘international reputation’ of Brit- 
ain (2/-1371/1.0). Like the Commons, they felt some concern about interna- 
tional ‘development’ (2/-1440/3.0). Discourses on international addressed 
economic issues with ‘monetary’ (1/-1301/1.0), ‘trade’ (6/-1432/2.3), ‘fund’ 
(1/-1460/2.0), ‘manufacturing’ (1/-1571/1.0), ‘bank (1/-1591/9.0) and finan- 
cial’ (1/-1794/1.0), but this was by no means the dominant discourse. Legal 
discourse played a more diversified role than at the time of joining the EEC, 
seen in association with ‘divorces’ (2/-1101/1.0), ‘crime(s)’ (2/-1360/-1518/1.0 
-4.0), law’ (7/-1516/3.6), ‘rules (2/-1582/6.5), ‘court’ (2/-1644/1.0), ‘standards’ 
(1/-1718/1.0), ‘regulatory’ (1/-1760/1.0), ‘justice’ (1/-1763/3.0) and ‘rights’ (1/ 
-2060/2.0). Noteworthy is the use of the metaphor ‘divorce’ to describe Brexit, 
with an emotional connotation side by side with concrete legal discourse. An 
exceptional intervention addressing ‘internationalist heritage’ (1/-833/1.0) has 
been chosen for closer reflection below. ‘National’ (2/-1597/2.0), ‘nation’ (1/ 
-1763/6.0) or ‘UR (2/-1877/8.0) had not become any more ‘international’ than 
in the membership debates, and there was really not anything ‘European’ that 
would appear as 'international either (2/-1916/5.5). In the context of Russian 
interventions in Western elections in general and the British referendum in 
particular, ‘international’ found an association in cyberattacks' (1/-1201/3.0) as 
well. Otherwise, the British debates on Brexit show considerable trajectories in 
the prioritisation of economy and the rise of legal discourse as a consequence 
of the membership. 


Individual Speech Acts Surrounding the ‘International 


The third and final step of our analysis proceeded as close and contextualising 
reading of some discovered collocations. Potentially interesting collocates were 
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pointed at in the above debate analysis. They were now located in their textual 
context in the House of Commons Parliamentary Papers database (Hansard 
1803-2005 and, for the EU case, Hansard Online of the UK Parliament). This 
phase allowed some checking of the functioning of the collocation analysis 
programme and our preliminary conclusions. The quotations were analysed 
as individual speeches in which politicians defined the ‘international’ by the 
active use of language in political action in particular contexts. These could 
only be reconstructed on an exemplary basis in the confines of this report. 
While general trends of thought on internationalism (such as the centrality of 
law and economy) become obvious on the basis the macro-level collocation 
analysis and need no extensive discussion here, some peculiar points deserve 
attention as they demonstrate the importance of context in determining what 
exactly was done politically in Parliament, also revealing shortcomings in mere 
distant reading. 

As the Commons debated the League Covenant, two Labour MPs came up 
with ‘opportunities’ opened up by it: for international relations in the spirit of 
the optimistic expectations of British internationalists, and for social reform 
central for the Labour Party. J. R. Clynes, a leading trade unionist and the dep- 
uty chairman of the Labour Party, welcomed the League ‘[a]s an instrument 
for providing, through the medium of International Courts and international 
action, an opportunity for considering differences as they arise’ (HC Deb. 21 
July 1919 vol. 118, c961). George Barnes, a former Labour leader who repre- 
sented in 1919 the pro-coalition National Democratic and Labour Party and 
had been one of the British negotiators in Paris, encouraged social reform in 
the spirit of labour internationalism by pointing that ‘for the first time Gov- 
ernments have put a chapter of Labour into an international Treaty [ILO] and 
made labour conditions a matter of international agreement, which constituted 
an 'opportunity' for workers worldwide (HC Deb. 21 July 1919 vol. 118, c976). 
These quotes exemplify the high leftist expectations for international coopera- 
tion during the post-First World War reconstruction. 

Liberal and Conservative internationalists in the Lords were, in the name 
of the government, also predominantly optimistic. James Bryce, a respected 
constitutional lawyer, member of the International Court at The Hague and 
Liberal politician, viewed the League as based on ‘the feeling that the world has 
now become one one in a new sense never dreamed of before’ surrounding ‘the 
belief that the community of the world requires that a new spirit should pre- 
vail in international relations—a spirit which seeks to substitute friendship for 
enmity (HL Deb. 24 July 1919 vol. 118, c1019). George Curzon (Conservatives), 
a major imperialist (as former Viceroy of India) and acting Foreign Secretary, 
echoed this belief in rising internationalism, stating, ‘the international spirit, 
the kind of idea that the future unit is not to be the race, the community, the 
small group, but is to be the great world of mankind, and that in that area you 
try and induce a common feeling, you try and produce co-operation which will 
be a better solvent of international difficulties .. (HL Deb. 24 July 1919 vol. 35, 
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c1029). The League was basically welcomed by all political groups in the British 
parliament, even if questioning its effectivity was also widespread. 

Doubts about weaknesses in organisation were equally present as the UN was 
formed in 1945. Prime Minister Clement Attlee, the leader of a Labour major- 
ity government, then forcefully advocated the concept of ‘collective security’ as 
the foundation of the UN Security Council ‘where the policies of the States ... 
could be discussed and reconsidered ... especially when they showed signs of 
such divergences as to threaten the harmony of international relations. Collec- 
tive security ... is active co-operation to prevent emergencies occurring’ (HC 
Deb. 22 August 1945 vol. 413, c665). Captain David Gammans (Conservatives), 
a former diplomat, was one of several MPs to question the definition of the UN 
as a provider of ‘collective security, suggesting that such a concept should not 
be used at all (HC Deb. 22 August 1945 vol. 413, c734). The required unanim- 
ity of the permanent members, after all, would constitute a major limitation to 
the functionality of the world organisation. In the Lords, Robert Cecil, the key 
planner of the League, a leader of the League of Nations Union and a Nobel 
Peace Prize winner (1937), consistently assured that ‘every attempt ought to 
be made, and must be made, to secure peace by international collaboration 
(HL Deb. 22 August 1945 vol. 413, c133), bridging two major projects of British 
internationalism. 

The debate on the CoE in 1950 provides a good reminder of the need for 
close reading and contextualisation. While a swift reading of the collocation 
results might suggest that ‘international pressure’ by the CoE was welcome, 
a closer analysis shows that the contrary was the case. Major Harry Legge- 
Bourke (Conservatives) was opposing restrictions to national or parliamentary 
sovereignty when arguing: ‘I am in favour of the Council of Europe, but I am 
in favour of it only on one set of terms, and that is that it remains as a council 
and does not become an international pressure group. There seems to be very 
real danger of it becoming an international pressure group? What particularly 
worried Legge-Bourke was a ‘desire for the institution of a European Political 
Authority' (HC Deb. 13 November 1950 vol. 480, c1479). 

Caution with far-reaching conclusions based on collocations is needed also 
in the case of the EEC membership. The MP who referred to ‘international 
periphery’ did not imply that Britain would become somehow peripheral out- 
side the Community, but was concerned about the potential loss of sovereignty. 
Ronald King Murray (Labour), a leading Scottish lawyer, reacted to a sugges- 
tion that membership in the UN and NATO already implied a loss of national 
sovereignty and relativised the radicality of an EEC membership by pointing 
out that: 


... it was sovereignty in a peripheral sphere, the international periph- 
ery of our being which did not involve the heart of our domestic con- 
stitutional being as the Bill unquestionably does. We are surrendering 
a portion of the inner core of our sovereignty because we are dealing 
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with two aspects of the constitution, first with an economic aspect and 
secondly with one which is more properly constitutional. (HC Deb. 15 
February 1972 vol. 831, c363) 


While general international or defence cooperation did not challenge national 
sovereignty, the economic and political aspects of the EEC did—a conclusion 
that has dominated much of the British press discourse ever since. Arguments 
in favour of membership won in 1972, although hardly in the extreme form 
presented by Frank Beswick (Labour), a former voluntary in the Spanish Civil 
War and a current party whip: ‘My own approach to this Bill is that of an inter- 
nationalist. I have always been ready to surrender sovereignty in those areas 
where individual and national dignity and wellbeing are not impaired’ (HL 
Deb. 26 July 1972 vol. 333, c1368). Confrontations were even tougher in the 
Brexit debates. The supporters of ‘leave, emphasising British national identity 
as distinct from the Continent, considered the vocabulary of the ‘international’ 
useless. ‘International’ was an attribute of the ‘remain’ side and often of back- 
benchers with limited political influence. An exceptional association between 
‘international’ and ‘European’ was made by Liz Saville Roberts (Plaid Cymru), 
who emphasised the significance of the EU for British foreign relations as a 
whole: ‘Beyond the single market and customs union, there are upward of 40 
pan-European agencies that form the basis of our international relations across 
a range of policy areas (HC Deb. 7 September 2017, c422). A suggestion that 
the government was acting against national values was heard in Helen Hayes’ 
(Labour) declaration that Brexit impacted negatively ‘our British values of tol- 
erance, diversity and internationalism’ (HC Deb. 11 September 2017, c574). 
Roger Liddle, a former researcher, adviser of Prime Minister Tony Blair and 
of the President of the European Commission and a think-tank chairperson, 
contributed to an intra-Labour dispute on Brexit. He appealed both to the tra- 
jectory of British internationalism and to the tradition of labour international- 
ism to persuade his party fellows to oppose Brexit: “Europe is in a category of 
its own in terms of its impact on future generations ... I want our party to lead, 
to seize this opportunity to demonstrate that, in contrast to this wretched Gov- 
ernment, we can live up to our national responsibilities and our internationalist 
heritage’ (HL Deb. 31 January 2018, c1534). Not only the Brexiteers but also 
their opponents were on the move, fighting on definitions of internationalism 
also with history-political arguments. 


Results and the Added Value of Digital Methods 
for Conceptual History 


This computer-assisted analysis combining the collocation tool of the Han- 
sard Corpus, the collocation analysis of membership debates and a contextual 
analysis of instances of political speaking has provided us with an overview 
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of the evolving discourse on internationalism in the UK Parliament, while 
also revealing innovative speech acts of potential political significance. Both 
expected and more surprising general trends were demonstrated, specificities 
of associations of the ‘international’ in the context of decisions on member- 
ship in international organisations pointed out and some peculiar arguments 
by individual MPs reconstructed. 

The concrete findings include the dominance of discourses on international 
law in the founding of international organisations other than the EU. Parlia- 
mentary discourse, rather than contributing much to the creation of its first 
major international institution, turned more internationalist as a consequence 
of the founding of the League of Nations. The first, rather weak, wave of British 
internationalism lasted from 1919 to the founding of the UN in 1945. Interna- 
tionalism remained relatively weak during the Cold War, with the exception of 
a few internationally oriented politicians, advocates of the EEC membership 
and expressers of global solidarity in the 1970s, when the second wave of British 
internationalism peaked. Once the EEC membership had become a reality, the 
parliamentary elite lost its enthusiasm about internationalism, especially with 
reference to European cooperation, and anti-European rhetoric rose during 
Thatcher’s governments. Economic debate dominated and the legal discourse 
was set aside as Britain joined the EEC, only to be restored with Brexit when 
especially the trajectory of discourse on national sovereignty versus interna- 
tionalism resurfaced. Early 20th-century discourses on labour international- 
ism have mostly withered away with the rise of non-socialist internationalism, 
with some revival in the 1970s and 1980s and during the Brexit crisis. Several 
factors indicate that internationalism was in decline well before Brexit, starting 
in the 1980s, and that trends in public discourse had implications for followed 
policies. Our conclusions correspond with Glenda Sluga's suggestion that the 
‘global seventies’ of new international society and international public sphere 
were followed by the ‘post-international’ 1990s,” but the British turn to post- 
international discourse clearly deserves more attention. 

Computer-assisted collocation analyses can contribute to conceptual his- 
tory in at least two ways:” First, the analysis produces quantitative data on 
associations between political concepts that enable us to estimate trends in 
political attitudes the reconstruction of which with traditional methods would 
not be possible. Second, distant reading reveals original political points that 
would have gone unnoticed in close reading or in full-text keyword searches. 
Such arguments must be subjected to close reading and contextual analysis so 
that premature conclusions based on distant reading can be corrected. Revealed 
peculiarities in argumentation frequently turn out to originate from leading 
politicians attempting to influence the course of policy, which warns against 
low frequency thresholds.” All in all, the collocation analysis of discourse on 
internationalism in Parliament works well, enabling a more efficient locating of 
meaningful speech acts, although their meanings can only be properly under- 
stood with close reading and appropriate contextualisation. 
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The collocation analysis could be extended on many levels: In distant read- 
ing, it would be important to proceed beyond the ‘extreme’ concept of ‘inter- 
nationalism, with its inherited associations of socialism and pacifism, and 
to include concepts such as the world, humanity, universal, global, Empire, 
Commonwealth, cosmopolitan, supranational, multinational, transnational, 
Europeanism and trans-Atlantic, for instance. The global (imperial) dimension 
has remained central for Britons also in the days of European integration, and 
the joint natural language with the United States has supported discourses of 
isolationism against internationalism. At the level of membership debates, syn- 
onyms and counter-concepts of the ‘international’ (foreign, abroad, domestic, 
national, etc.) could be considered. At the level of individual speech acts, the 
dynamics between political parties in arguing about internationalism would 
deserve more attention. A more extensive reconstruction of the temporal con- 
texts of the arguments on the basis of digitised newspapers and other forums of 
political debate would be helpful then. Furthermore, now that the sub-themes 
of the discourse on internationalism in the UK Parliament have been identi- 
fied, the analysis of the dynamics between nationalisms and internationalisms 
in other parliaments and transnational interconnections between these debates 
could be explored. 


Notes 


! Pasi Ihalainen was alone responsible for the planning of the research set- 
ting, analysis and the written report. Digital analysis method specialist 
Aleksi Sahala created the program script that enabled the distant reading of 
particular membership debates. Kimmo Elo produced the visualisations 
of the results of digital distant reading. 

? Yearwood 2009; Laqua 2011; Mazower 2012; McCarthy 2012; Sluga 2013; 
Clavin & Sluga 2017. 

3 Thalainen & Palonen 2009; Ihalainen & Matikainen 2016; Ihalainen 2017; 
Holmila & Ihalainen 2018; Ihalainen 2018. 

^ Edelstein 2016. 

5 Baker et al. 2008. 

° Baker, Brezina & McEnery 2017: 105. 

7 Gabrielatos & Baker 2008: 11. 

* Baker, Brezina & McEnery 2017. 

? Willis 2017. 

10 Foxlee 2018: 77, 80. 

1 Guldi 2019. 

? Lahdesmaki & Wagenaar 2015: 16. 

13 Ihalainen 2006; Halonen, Ihalainen & Saarinen 2015; Ihalainen & Saarinen 
2015; Steinmetz & Freeden 2017; Ihalainen & Saarinen 2019. 

4 Daille 1994. 
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5 Church & Hanks 1990. 

16 The technical description of the analysis program has been written by 
Aleksi Sahala. See https://github.com/asahala/Collocations. 

17 Clavin & Sluga 2017: 5-6; Sluga 2013, 3, 5. 

18 Sluga 2013: 4. 

? Holmila & Ihalainen 2018. 

? [bid. 

21 [bid. 

? Ibid. 

? Cf. Häkkinen 2018. 

^ Sluga 2013: 6-7, 9. 

3 Cf. Steinmetz & Freeden 2017: 32, who are uncertain as to how to interpret 
semantic data rising from digital humanities. 

*° See also Kim 2014: 233. 
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CHAPTER 13 


Picturing the Politics of Resistance 


Using Image Metadata and Historical Network 
Analysis to Map the East German Opposition 
Movement, 1975-1990 


Melanie Conroy and Kimmo Elo 


Introduction: Networks of the East German Opposition 


This chapter shows how network graphs and analysis can be used to shed light 
on the structure and dynamics of the geospatial social networks of segments 
of the East German opposition movement between 1975 and 1990. What new 
knowledge can we uncover about a well-studied historical phenomenon if 
we combine the use of non-traditional source material, in this case metadata 
from an image database catalogue, with a non-traditional historical method- 
ology, namely, social network analysis? This chapter studies the network of 
East German dissidents as reflected in the photographic database on the East 
German Opposition, which archives photos from the 1970s until the fall of the 
Berlin Wall. 

In this chapter, we examine graphs of East German dissident networks, as 
well as sub-networks filtered by date and by place. We then discuss the general 
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principles of social network analysis and historical network analysis. Finally, 
we consider how network analysis could be further improved for historical 
purposes. The primary tool we use is Palladio, a suite of tools for the visu- 
alisation of historical networks developed in the Humanities + Design Lab at 
Stanford University. Unlike many tools for creating network graphs, Palladio 
was designed for humanists to visualise data without the need for a designer! 
Palladio can be used to visualise historical datasets and discover patterns in the 
data that researchers may choose to analyse using network analysis or other 
means, including examining the original historical sources. 


An Unconventional Source Material: 
Metadata of an Image Database 


The photographic database maintained by the Robert Havemann Society 
in Berlin as part of its archive on the East German Opposition consists of 
approximately 60,000 digitised photos with a relatively rich metadata provid- 
ing information, including the date the photo was taken, the photographer a 
descriptive title, keywords, regional/geographical tags and information about 
the persons to whom the photo is related? The sample used in this chapter 
consists of photos featuring selected prominent figures of the East German dis- 
sident scene with a connection to the city of Jena. These individuals included 
academics, artists and intellectuals of diverse socio-economic backgrounds, 
who were part of a range of movements, from youth movements to peace activ- 
ism to environmental movements, during the period from 1975 to 1990. In 
most cases, the photos were taken to document opposition action and activi- 
ties and used as illustrations in underground magazines, bulletins and leaflets. 
We should, however, keep in mind that since the German Democratic Repub- 
lic (GDR) was a dictatorship, photographing these kinds of illegal actions was 
closely bound with the risk of becoming subject to counter-measures by the 
security authorities. From this perspective, the photos also document the cour- 
age of the people involved in oppositional activities. 

In the history of the East German opposition, Jena and Berlin were the 
two most important regions when it came to the structure, means, motives 
and dynamics of the opposition groups in the GDR. In Jena, the discrep- 
ancy between democracy and dictatorship often led to open conflicts, mak- 
ing this city the primary region of political opposition in the GDR. Jena was 
also called the secret capital of the GDR opposition, reflecting the complex 
domestic conflict between the state apparatus, church and opposition in the 
GDR.’ The temporal focus of this chapter is the period between 1975 and 1990, 
a period heavily shaping the range of political action for the opposition and 
resistance groups. One key event was the Conference on Security and Coop- 
eration in Europe (CSCE), held in Helsinki in August 1975, which caused 
the East German political leadership to become increasingly concerned with 
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the destabilising impact of the CSCE.* As a consequence, the Sozialistische Ein- 
heitspartei Deutschlands (SED), the communist monopoly party of the GDR, 
took numerous repressive actions against dissidents and opposition groups, 
seeking to scatter the resistance and opposition by eliminating their leading 
personalities? To what extent can we see the repressive actions of the SED in 
the networked relations of Jena-linked figures of the opposition? 

One of the main components of the Jena-based opposition movement was 
the Jena Peace Community (Jenaer Friedensgemeinschaft), established in March 
1983 as one of the first major opposition communities outside the protective 
walls of the Evangelical church. The local Peace Community was a dissident 
platform of short duration, but of long-lasting impact. Its founders were 
disillusioned with the reluctant resistance of the Evangelical church against 
state repression and, hence, sought to establish a new, independent platform 
under the umbrella of the European Peace Movement. The community itself 
was short-lived, because by the spring of 1983, the security authorities had 
already decided to destroy the Jena Peace Community once and for all. But, 
because the security police did not achieve its main goal (a complete destruc- 
tion of Jena’s opposition), the community had a long-lasting impact, causing 
Jena to remain an unsettled city and one of the most important places for 
political opposition between 1983 and 1989.* Members included Uwe Behr, 
Manfred Hildebrandt, Mario Dietsch, Edgar Hillmann, Michael Rost and 
Frank Rub, and non-church-members Roland Jahn and Petra Falkenberg. 
As we shall see, many of the members and allies of the Jena Peace Commu- 
nity remained active in the opposition movement in Jena for years after the 
crackdown. 

The data culled from the database catalogue combine many of the elements 
of historical research which can be profitably analysed and give us an opportu- 
nity to study a network that is relatively circumscribed in both time and place. 
Further, the connections between individuals are consistent since they all rep- 
resent co-occurrences in photos. Knowing the boundaries of the network and 
having an understanding of what the underlying data represent are fundamen- 
tal to creating a data model and visualisations which contribute to a research 
problem rather than merely illustrating an archive. The individuals who appear 
most frequently in the data are known historical figures, many of whom were 
instrumental in the creation of the archive: Matthias Domaschk, Jürgen Fuchs, 
Roland Jahn, Robert Havemann, Katja Havemann, Bettina Wegner, Carlo 
Jordan, Gerd Poppe, Bärbel Bohley and Tom Sello.” 

Many of these photos are of protests and actions by the opposition; other 
photos are casual portraits and group shots not obviously related to any polit- 
ical action. The photographers are recorded (where they are known) by the 
archivists; the photos were, for the most part, taken by members of the group 
and their acquaintances. We see one example of a group photo (Figure 13.1) 
taken at Ulrike and Gerd Poppes home in Woltersdorf. In this photo, we see 
Robert Havemann, Ulrike Poppe and others who are gathered for a reading of 
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Figure 13.1: Reading at Ulrike and Gerd Poppe’s property, photo by 
Gerd Poppe, courtesy of Robert-Havemann-Gesellschaft, 27 June 1981. 
Source: Robert-Havemann-Gesellschaft/Gerd Poppe/RHG_Fo_HAB_09781. 
All rights reserved. 


the works of Gert Neumann on 27 June 1981, after the release of Havemann 
from police custody. 

In this photo, we see the conviviality of the opponents of the regime who are 
drinking wine and enjoying one another's company despite the serious circum- 
stances. This is a rather typical example from the photo collection documenting 
activities of East German opposition. At the same time, since photos like this 
successfully document social actions taken by individuals, they offer a reliable 
source to reconstruct historical social networks. 


An Unconventional Method: Why Network Analysis? 


Social network analysis has been used since at least the 1940s in the social sci- 
ences.? Network analysis has only more recently been adopted within histori- 
cal disciplines.? Networks are a powerful analytical tool for understanding the 
structure of groups, especially at scale or when there are complex interrelations 
between large numbers of individuals. While the network is a 20th-century 
concept, social relations that could be described as networks have existed in all 
historical periods and in all societies." Whether we can profitably examine one 
particular social group through the lenses of network analysis is a reflection 
of how much we know about the internal structure of that group, the research 
question being asked and the completeness of the historical documentation 
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that could be used to reconstruct the network." Historical network analysis 
(HNA) is deeply rooted in social network analysis (SNA), using the same basic 
structures and metrics. But HNA must engage with questions of historiography 
and the use of sources in a way that is different from sociological methods. In 
this chapter, we show how network graphs and analysis can be used to shed 
light on the structure and dynamics of the geospatial social networks of seg- 
ments of the East German opposition movement between 1975 and 1990. This 
chapter’s exploration of dissident networks serves as an example of how schol- 
ars can discover connections between individuals and sub-groups that might 
track the spread of dissident thought between people, as well as between geo- 
graphical regions. 

SNA allows us to study the meaning and importance of relationships in great 
detail and, thus, offers a promising tool to examine past communities in the 
aggregate: 


With SNA, we are only interested in individuals as part of a much bigger 
whole. In fact, one advantage to the technique is that SNA helps us view 
an entire community and figure out which individuals we should be 
truly interested in and which ones were perhaps less significant. When 
we study past relationships systematically as SNA allows, the method 
will prevent us from misunderstanding the function of an individual’s 
relationships or exaggerating the distinctiveness of those relations." 


In our eyes, the true power of HNA lies in its capability to untangle complex 
social interaction patterns by way of graphical visualisations, thus making 
those patterns easier to perceive and analyse. The limit of network analysis is 
in researchers reducing social relations to those patterns and potentially los- 
ing track of the ways in which the network abstracts from the source material. 
For this reason, HNA should be practised with a keen awareness of the source 
material and the cultural and historical context of the networks at hand. 


Using Network Graphs to See the Big Picture 


One of the difficulties in creating network graphs is to create legible diagrams 
which show the structure of the entire network. Many of the most famous net- 
work diagrams are both complex and vast. Such diagrams are often referred to 
derisively as ‘hairball’ graphs due to their illegibility.? As we shall see, using 
a combination of complex graphs of the whole network and smaller, more 
precise graphs of network segments is an easy way to overcome this problem. 
Figure 13.2 shows the network as a whole. The cleaned-up and corrected data- 
base contains 841 records (photographs) with 171 unique person references. 
There are a total of 1,843 co-occurrences of these individuals in photographs 
documented in the database. 
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Figure 13.2: Network graph of GDR dissidents linked to Jena in photographs, 
1975-1990. Source: Authors. 


This network graph was created in one step in Palladio and was not modified, 
so it is relatively difficult to read, but it gives us a ‘quick and rough overall view 
of the network. There are many tools available for creating network diagrams 
which can make use of colour, refine the design and change the spacing and lay- 
out. A network is a set of nodes (in this case, people) linked by edges (in this 
case, photo co-occurrences); in this graph, the nodes are labelled with the name 
of the person and sized based on the number of photos that person appears in. 

It is apparent in Figure 13.2 that the majority of people who appear in the 
photos appear together. Figure 13.2 evidences the existence of a core social 
network of densely connected people, in network terms, the giant compo- 
nent. Outside this core component there are four distinct yet very small social 
groups. These sub-networks emerge from photographs taken in apartments or 
in the outdoors, apparently just documenting daily events and activities and 
people involved in these. In fact, such photographs form a large part of the 
giant component as well, despite the repressive political circumstances in 
the late 1970s and the 1980s, which meant that taking photos documenting 
members of underground and dissident groups was a rather risky business. 

One way to make network graphs more legible to humanities scholars is to 
combine graphs with other types of diagrams with which they may be more 
familiar.'* Palladio can be used to make maps, galleries and tables which can 
complement network graphs.'* In this case, we can use the same dataset to 
produce a map which shows the approximate number of photos taken at each 
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location. In order to obtain geographical data, we processed the keyword entries 
and collected all recognised geographical names and sub-regional information 
like street names or city districts which could be used to determine the city 
(for example, Berlin, Jena, Bad Frankenhausen). Almost three-quarters of the 
photographs included in our analysis could be connected with a geographical 
location. In the last step, we automatically geocoded these locations, finding 
latitude/longitude coordinates for all recognised locations for purposes of geo- 
spatial analysis. Figure 13.3 shows the locations of the photographs of individu- 
als associated with Jena; the size of the circles represents the number of photos 
at that location. The green circles represent photos taken in East Germany and 
the red circles represent photos taken in West Berlin. 

It is apparent in Figure 13.3 that Berlin and Jena are the most important 
regions and are strongly linked. We know from analysing the temporal distri- 
bution of these photographs that Jena remains the most important region until 
1984, thus confirming previous studies stressing the importance of the Jena 
region for the East German dissident community. From 1985 onwards, Berlin 
gains in importance and becomes the most frequently referenced region in our 
data. This change is well in line with the overall course of events during the 
second half of the 1980s. A good example of how this change is connected to 
specific places are photographs referring to the Umweltbibliothek (Environment 
Library) in East Berlin. The library was founded in 1986 in the cellar rooms of 
the Zionsgemeinde and rapidly became one of the central communities of the 
East German dissident movement. What is not so clearly visible in the network 
graph (Figure 13.2) or the map (Figure 13.3) but evidenced by the dataset itself 
is that the core social network of East German opposition was rather small, 
revolving around certain key figures who were rather mobile. These central 
figures account for connections to some of the smaller towns. A good example 
of this phenomenon are the towns Fürstenwalde (Spree) and Grüneheide, both 
loosely connected to Robert Havemann (1910-1982), an intellectual and dis- 
sident sentenced to house arrest in 1976. A remarkable portion of photographs 
taken between 1975 and the early 1980s document Havemanns life under 
house arrest and the people visiting him. 


Freezing Networks: Snapshots in Time and Local Networks 


Just as giving researchers a view of the whole is important, it is helpful to 
visualise the network in specific places and at specific times. We can do that 
easily by filtering by place in Palladio and visualising only the nodes and edges 
associated with one place. Figure 13.4 shows the Jena-associated people who 
appear in photographs taken in West Berlin. West Berlin is a locus for indi- 
viduals who were deported from East Germany but continued to be of interest 
to the East German government, such as Jürgen Fuchs and Roland Jahn. We 
can see that the network in West Berlin is much smaller, but still contains very 
important figures. 
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Figure 13.3: Map of photographs of GDR dissidents linked to Jena, 1975-1990, 


basemap. Courtesy of the David Rumsey Collection. Source: Authors; 
basemap Haack 1965. 


Roland Jahn was an active member in the Jena dissident community, who 
engaged himself as a young university student in protest actions from the 
mid-1970s onwards. Jahn was ex-matriculated from the University of Jena in 
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Figure 13.4: Network graph of GDR dissidents linked to Jena in photographs 
taken in West Berlin, 1975-1990. Source: Authors; data from the Robert- 
Havemann-Gesellschaft. 


1977, arrested again in the early 1980s, and finally expelled to West Germany 
in May 1983. However, even after his expulsion, Jahn continued to support 
the Fast German opposition movement. Järgen Fuchs, in turn, had already 
been expelled to West Germany in 1977, but he remained involved in the Fast 
German dissident community until 1990, especially via Lutz Rathenow, another 
strong figure in the network visualised in Figure 13.4." Figure 13.4 shows that 
the networks of Fuchs and Jahn were, indeed, interconnected and formed the 
core of the network in West Berlin. There are other sub-networks (for example, 
the sub-network without a clear hub) of which Lilo Fuchs, Wolfgang Diete and 
Lutz Leibner were members. Despite appearing in a similar number of photo- 
graphs as Jahn, Fuchs connects more individuals and more disparate parts of 
the network. 

Similarly, making network graphs of specific moments, or 'snapshots; of the 
network at various times can be highly informative in understanding the evolu- 
tion of the network and disentangling connections made in different periods.'* 
Whereas Figure 13.2 is a static network that displays all of the nodes and con- 
nections in the period from 1975 to 1990, Figure 13.5 shows only the nodes and 
edges present in 1981. Tom Sello, the construction worker from Großenhain, is 
central to one network here, despite his relative youth at 24 years of age. 

The dataset was filtered to the year 1981 to produce Figure 13.5; the graph 
shows two clusters of dissidents and their associates who appear in photos 
together. The first cluster is centred on Tom Sello and six other people in Säch- 
sische Schweiz. The second is a larger set of individuals in a tightly clustered 
network located in Woltersdorf and centred on the trio of Reinhard Weißhuhn, 
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Figure 13.5: Network graph of GDR dissidents linked to Jena in photographs 
taken in 1981. Source: Authors; data from the Robert- Havemann-Gesellschaft. 


Lutz Rathenow and Elke Erb. The high clustering shows that this group often 
appeared in many photos together. In fact, you may recognise this horizon- 
tal network of people with many connections to one another from the photo 
of the reading at Ulrike and Gerd Poppe’s house on 27 June 1981 (Figure 13.1). 
This group appears in multiple photos together at the Poppe residence, which 
accounts for their high degree of association in the data. Unlike the photos of 
Tom Sello and his associates, these photos document a large and important 
gathering of leading figures from the movement in the same space. 


Clarifying and Simplifying Network Graphs 


Another way of cleaning up network graphs to make them more legible is to 
remove nodes that fall below a certain threshold in connection to the core of 
the network. In the case of Figure 13.2, we observed that the giant component 
comprised the majority of the nodes and edges, but was difficult to analyse due 
to its density. In order to focus on this giant component, which forms the core 
of the network, we removed the nodes which were not connected to the giant 
component and then graphed that network using Gephi. Figure 13.6 shows this 
network core. 

By focusing only on the central component of the network, we can make out 
more of the network structure. For instance, we can see which nodes function 
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Figure 13.6: Network graph of GDR dissidents linked to Jena in photographs, 
1975-1990. Source: Authors; data from the Robert-Havemann-Gesellschaft. 


as hubs, having a number of connections that far exceeds the average number 
per node; we can also see which nodes function as connectors, tying together 
different sub-networks. Within this network, the centrality of Roland Jahn 
derives from the high number of individuals with whom he appears in photos, 
which makes him a hub; it is also significant that he connects the various com- 
munities, or cliques, we see in this graph. Other hubs include Bárbel Bohley, 
Mattias Domaschk and Carlo Jordan, all of whom appear in a large number of 
photos with other individuals. Many of the individuals who appear in photos 
with Domaschk and Jordan do not appear in photos with many (or any) oth- 
ers. Mattias Domaschk was active in Jena, including in the Junge Gemeinde 
Jena-Stadtmitte, but his contacts mostly only appeared in photos with him. 
Jordan was primarily active in the Berlin region, where he went on to become 
a leader in the green movement. Jürgen Fuchs and Tom Sello both have signifi- 
cant numbers of connections with otherwise isolated nodes, but they also have 
more connections to the broader network. 

Comparing these observations to our lists of the presumed core mem- 
bers of the Jena group, especially Matthias Domaschk, Jürgen Fuchs, Barbel 
Bohley and Tom Sello (see above), we can see the group was strongly connected 
through Roland Jahn, at least in terms of co-appearances in photographs. We 
can also readily observe that the women in the group, specifically Katja Have- 
mann, Bettina Wegner and Petra Falkenberg, play less of a central role in tying 
together the network of photographic co-occurrences than do the major male 
figures. The relative absence of female figures from the core of the network 
makes the centrality of the artist Bärbel Bohley that much more striking, 
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especially in the latter years when she is highly central. Figures like Mattias 
Domaschk were photographed with others in the opposition movement who 
did not appear elsewhere in photos, much as we saw earlier with Tom Sello, 
suggesting that he was bringing many people into the movement or associating 
with non-opponents of the regime. On the other hand, Bárbel Bohley is part of 
a sub-network that has many more co-occurrences among its members and has 
no clear hub. Again, it is worth investigating whether this photographic pattern 
has any reflection in real-world relationships or if it is an artifact of the photo- 
graphic record, as was the case with the clique centred on the Poppe residence 
in Woltersdorf. Jürgen Fuchs, Carlo Jordan and Tom Sello are each connected 
to their own small ‘clique’ of people together with whom they appear in pho- 
tos, although these cliques are more cut off from other sub-networks, as well 
as being smaller. The most significant mystery remains why the sub-network 
containing Bärbel Bohley is more diffuse and "leaderless than the rest of the 
sub-networks. When we examine the photos in which she appears, we see that 
she appears in many photos with crowds and figures who are not otherwise 
in the database, including, for example, the Dalai Lama. It would appear that 
Bohley brought many new people into the movement and connected dissidents 
to a broader world of activists than earlier ‘hubs. 


Conclusions 


HNA can make use of the tools of SNA to understand the structure of social 
relationships, whether in smaller networks or at scale. While some of the tried 
and tested techniques, like large 'hairball graphs, may be limited in their use 
value for historians, network graphs which reflect real research questions, such 
as the shape and size of a network in a particular year or in a particular place, 
can easily be created when one is familiar with how information has been cap- 
tured and structured.” These graphs are not mere illustrations of previously 
known relations, but a way of exploring segments of a large dataset to find new 
patterns and new questions: how, for example, the GDR opposition movement 
changed in relation to events like deportation to Berlin or the release of a mem- 
ber from detention. We have seen that the changes in the network following 
such events are not always predictable, such as when the movement held a very 
large and semi-public reading at Woltersdorf following Havemanns release or 
when Bärbel Bohley’s very public events sometimes led to photographs depict- 
ing fewer documented members of the movement. For this reason, it is fun- 
damental to consider what the graph is depicting (in this case appearances 
together in photos) and not to use the graphs in a naive fashion to represent all 
real connections between individuals. 

This chapter sought to exemplify how HNA could be used to explore and ana- 
lyse personal and geospatial connections and ties behind a real historical phe- 
nomenon, the East German dissident movement. The networks reconstructed 
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and analysed in this chapter fit quite well with the historical facts about the East 
German opposition, although we need to note that we cannot estimate how 
much data is missing or whether a larger sample would have affected our results 
differently. Nor can we know to what extent patterns in the photographic co- 
occurrences reflect relationships outside of photographs. Network analysis is 
known to be quite sensitive to missing data,” but applying network analysis 
can reveal discrepancies, such as when a group gains members or loses them, or 
when the structure ofa group changes from tighter to looser, as the Jena-linked 
dissidents often did when they were deported or arrested. Network analysis can 
also reveal when group members like Tom Sello or Bárbel Bohley introduce 
many new people to the network. Despite these pitfalls, we are hopeful that 
our chapter could convince its reader that HNA could help scholars to gain 
new insights into the way in which social networks reflect political pressures 
in sometimes unexpected ways (for example, by growing or becoming more 
public in response to crackdowns). Although not impossible, such insights can 
be difficult to gain through the traditional methods of historical research. 


Notes 


' Conroy 2019. 

? Robert-Havemann-Gesellschaft e.V., Archiv der DDR-Opposition, Bildarchiv. 

3 The security authorities’ operation ‘Counter-strike’ is documented in 
BStU 2013. 

1 See, e.g., Schroeder 1998: 233ff; Gieseke 2008. 

> Veen 2000: 27-29. 

° Neubert 1998: 488. 

? Matthias Domaschk was a young political activist, who died on 12 April 
1981 in Gera in a pre-trial detention of the East German security service 
after 13 hours of continuous interrogations. Tom Sello, in turn, engaged 
himself in several dissident groups in the GDR, especially in the 1980s. He 
also wrote for several underground publications (Samisdat) and was repeat- 
edly attacked by the security service. Bettina Wegner was an East German 
songwriter and lyricist. In 1983, she was threatened with prison and forced 
to leave the GDR for West Berlin. Gerd Poppe was a political activist who 
fought for human rights in the GDR. He was also actively engaged in the 
publication and dissemination of several illegal underground publica- 
tions (Samisdat). Poppe was subject to the Stasi's intensive observation and 
repressive activities. Barbel Bohley was an East German opposition activist 
and artist. She was one of the co-founders of the Initiative for Peace and 
Human Rights (1985) and of Neues Forum (1989). For a detailed descrip- 
tion ofthe other people mentioned, see Elo 2018. 

* Fora history of the development of SNA within the social sciences, see Prell 
2012: 19-50. 
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? On how HNA differs from SNA more generally, see Düring and von 
Keyserlingk 2015. 

? Lemercier 2015. 

! Ibid. 

? Morrissey 2015: 69-70. 

? Nocaj, Ortmann & Brandes 2015. 

^ Some of the most commonly used are Gephi, Cytoscape and R. 

5 Conroy et al. 2020. 

'* Conroy 2019. 

7 See further Elo 2018; Neubert 1998. 

35 Conroy et al. 2020. 

? Drucker 2011. 

2 Wetherell 1998: 125. 
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CHAPTER 14 


The Many Ways to Talk about the 
Transits of Venus 


Astronomical Discourses in Philosophical 
Transactions, 1753-1777 


Reetta Sippola 


A Popular Astronomical Event 


In the 1760s, one of astronomy’s rarest predictable phenomena, the so-called 
Transit of Venus, was calculated to take place twice: in 1761 and in 1769. This 
phenomenon, when the planet Venus passes across the Sun, from the Earth's 
vantage point, was not only extremely rare, as the previous transit had taken 
place in 1639 and the next was to follow in 1874, but also very valuable scien- 
tifically, as observing this kind of transit would make it possible to determine 
the distance between the Earth and the Sun more accurately than before. This 
could in turn make it easier to improve a number of practical issues relying 
on astronomical knowledge, foremost among them to improve the accuracy 
of calculating locations at sea, which at this time was at best inaccurate, often 
resulting in costly and deadly accidents. Thus, the two Transit of Venus events 
and the astronomical information that could be derived from observing them 
enjoyed wide interest among both scientific professionals and the general 
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public. The scientific interest in the transits during the 18th century was rep- 
resented through a large number of news items and scientific reports in the 
scientific literature, especially in scientific periodicals, such as the Philosophi- 
cal Transactions of the Royal Society of London. In short, there was a large and 
varied scientific discourse talking about the Transits of Venus. This chapter 
explores what new historical knowledge about science in the 18th century we 
can derive from using digital methods to study such scientific discourse related 
to a particular scientific phenomenon. 

However, alongside the natural philosophical news and reports, many 
broader perspectives towards early modern knowledge and the early modern 
world itself were also communicated in the scientific journals. By using digi- 
tal history methodologies to analyse the qualitative meanings and quantitative 
amounts of the common topics and themes in a scientific periodical, I suggest 
in this chapter that there were simultaneously nine different ways of talking, or 
discourses, about astronomy and the two Transits of Venus of the 1760s. 

Astronomy has been one of the main ways for scientists to explore expla- 
nations about our place in the universe, but astronomical knowledge has also 
been used for other more practical applications in economy, politics and trans- 
portation. This combination of pursuing natural philosophy endeavours both 
for knowledge and for practical applications was also central for the Royal 
Society of London, established in 1660 in England for improving the knowl- 
edge of nature and mankind, as articulated in their full official name. To serve 
astronomical practice, the Society funded two large expeditions to make obser- 
vations of the 1760s Transits of Venus, but at the same time and partly through 
these expeditions they also took part in the larger and wider transformation of 
how the natural sciences were understood in the 17th and 18th centuries, by 
creating a communal and public space for circulating the new knowledge. The 
Society circulated a newsletter that disseminated and shared the new scientific 
information coming from collecting and observing abroad on scientific voy- 
ages and commercial encounters, as well as through experiments at home. The 
letter soon turned into the form of a periodical, entitled Philosophical Trans- 
actions Giving some Account of the present Undertakings, Studies, and Labours 
of the Ingenious in many considerable parts of the World, established in 1665 
and published as the Society's official journal from 1752. From the mid-18th 
century onwards, Philosophical Transactions was a journal publishing scien- 
tific correspondence that had been selected and reviewed by the fellows of the 
Society. The journal also had a much wider readership than just natural philos- 
ophers.! Its topics included news about the latest innovations and discoveries 
and reports on geography, natural scientific specimens and natural and man- 
made phenomena, such as weather and electricity. Even the work of amateur 
experimenters was published, whereas many professional reports sent to the 
Royal Society were rejected and put to one side with a note ‘not to be printed? 

In the mid-1700s, many nations participated in a race for the knowledge, pres- 
tige and power that could be gained through successful observations of the two 
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predicted Transits of Venus. National and international networks of knowledge 
invested a significant amount in astronomy research, and the importance of 
succeeding in observing the Transits of Venus has often been compared to the 
Cold War space race of the 20th century. Central in this were the observation 
instruments which were carefully collected and arranged for transportation to 
judiciously planned observation locations.’ The Transits of Venus were then 
widely discussed in various contemporary printed periodicals, as well as in the 
general press. It was understandably a prominent topic in periodicals such as 
Philosophical Transactions reporting on the latest news within natural philoso- 
phy. The emphasis was on the conducted experiments and the reporting and 
theorising of the events. However, at the same time, the Royal Society’s public 
communications also conveyed other more elusive and implied perspectives 
on the philosophical inquiries to their wide group of readers. In this light, this 
chapter uses the reporting of the Transits of Venus in Philosophical Transac- 
tions to critically discuss what kind of view of ‘scientific inquiry was offered 
to its readers: What was actually discussed about astronomy in particular and 
the new science in general among the texts that succeeded in being printed in 
Philosophical Transactions during the late 18th century? The guiding hypoth- 
eses behind the study is that the general values of 18th-century British society 
and the philosophical environment have been repeated directly and indirectly 
in the texts. By re-reading this discourse with the digital methods, we can access 
the underlying thoughts, paradigms (such as a shared model of the universe) 
and practices related to experimental science that existed at this time. 


Topic Modelling the Publication of New Science 


The historical meanings of the editing and peer reviews of Philosophical Trans- 
actions has been studied by Ellen Valle in particular, and subsequently by Julie 
McDougall-Waters, Noah Moxham and Aileen Fyfe.* My research continues 
their work by applying the use of ‘machine reading’ and the method of topic 
modelling for finding the underlying patterns and values that have influenced 
the public writing about new scientific knowledge. By comparing statisti- 
cal amounts of various topics in Philosophical Transactions at different times, 
topic modelling reveals temporal changes in scientific approaches in the many 
ways of talking about the process and of the degrees of certainty within the 
emergent sciences. This could be a major contribution to the history of early 
scientific communications because this kind of temporal change has not been 
discussed in the earlier scholarship, which has only applied manual reading of 
these resources. The digital methodology of topic modelling provided a way to 
examine the large amount of texts in Philosophical Transactions and to reveal 
the changing patterns in the ways in which the two transits were discussed. It 
located the internal relations inside the ‘big data, comprising all the words in 
the texts, and teased out their shared and underlying meanings. 


240 Digital Histories 


I have focused on 1753-1777 as a period of 25 years containing the important 
astronomical events (the two Transits of Venus in 1761 and 1769). This period 
also covers the time after the Society had formally taken over the journal in 
1752 and begun editing the articles communally via the so-called Committee 
of Papers, and before it changed into a new period of knowledge circulation in 
1778 when Joseph Banks took on the presidency of the Royal Society. The time 
frame also contains the collection of new data during world-changing voyages 
of exploration, such as those of Captain Wallis in 1765-1768 and Captain Cook 
in 1768-1780, both supervised by the Royal Society. 

The selected corpus of texts from Philosophical Transactions consists of a 
set of 1,421 documents relating to the transits, which represents a collective 
public discourse on scientific discoveries, innovations and experimental rou- 
tine (a typical genre of texts published in the journal). The texts of the corpus 
were generated by optical character recognition (OCR) of a digitised version 
of Philosophical Transactions provided for research purposes by JStor, and I 
used a temporal selection of the Royal Society Corpus? which had been collected 
and pre-processed in a previous linguistics project headed by Elke Teich.‘ Their 
pre-processing included the transformation of data into a standardised format, 
cleaning of data (for example, OCR errors) and derivation and annotation 
of metadata.’ 

The research process consisted of three stages, which combined statistical 
and computational quantitative methods with qualitative analysis of the texts. 
During the first stage, I applied topic modelling to the corpus to create lists of 
probable keywords describing the themes existing within the data corpus. To 
operate with the topic modelling algorithm, I prepared a so-called ‘stop list’ of 
very common words (such as the, is, as, etc., as well as prepositions and con- 
junctions) to be filtered out and excluded in the analysis of the corpus. I also 
arranged the processing of the corpus temporally by dividing it into five-year 
sets (1753-1757, 1758-1762, 1763-1767, 1768-1772, 1773-1777) and then 
ran these through the topic modelling algorithm implemented in the MAL- 
LET software application. The output was a list of various topics or themes of 
interconnected keywords co-occurring throughout the articles. As a prelimi- 
nary analysis, I used MALLET to produce a varying number of topics from the 
corpus and concluded that the best and most realistic fit, in terms of what from 
the keywords appeared to be relevant and meaningful topics that were not too 
general or too narrow, was when the number of topics was 50. Following this, 
I grouped the topics along semantic similarities in their keywords in order to 
locate their relations to the scientific contexts in which the original texts of the 
corpus were created. This was a crucial part of the research; while a ‘topic’ to 
the computer is merely a list of words that occur together in statistically mean- 
ingful ways; in the following manual analyses by the researcher these lists are 
shown as a semantically meaningful string of keywords which need to be sewn 
together with a larger thread of historical context and interpreted in order to 
provide a meaningful label that refers back to the historical context. Therefore, 
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the historical work of contextualising the English 1700s science discourse was 
ongoing throughout the entire research process. 

In summary of the above, I programmed MALLET to create a list of 50 top- 
ics, each consisting of 20 keywords, and entered these into a spreadsheet which 
showed the number of these topics during the chosen years. The keywords 
represent co-occurrence patterns in the data corpus and probabilistic appear- 
ances of a particular theme. I located keywords that signified general trends 
and particular themes, such as those that concerned instruments in observa- 
tion and experiments. After examining and labelling all the topics which were 
present in 2% or more of the text mass, I grouped the topics according to their 
semantic similarities. This revealed that the common topics in the data could 
be roughly grouped according to ‘polite correspondence style, ‘astronomy, 
‘chemistry; ‘weather’ and ‘instruments and their use. In actual fact, it was not 
until this stage that the processing of the data highlighted the most fruitful 
research questions warranting further investigation. 

As mentioned above, Teich’s linguistics team had already carried out research 
that applied topic modelling to the first two centuries of Philosophical Transac- 
tions. Their visualisations enable us to see topical trends in the corpus, in par- 
ticular in terms of discipline formation and specialisation, that there is a growing 
separation of individual scientific disciplines over time and that the discourse 
became increasingly specific over time.’ Aiming at different perspectives on the 
proposed material and research questions, I thus designed the commands and 
continued to fine-tune my topic modelling less towards general themes stretch- 
ing over large time periods, and more towards more specific themes connected. 
to my more focused period and scientific developments. As a result, this is an 
example not only of applying a fresh perspective on an already known resource, 
but, importantly, also of a collaboration among experts which is characteristic 
for contemporary digital humanities.'° Teich's team identified 24 different top- 
ics in Philosophical Transactions which indicated the development of scientific 
sub-disciplines (including Chemistry, Mechanics and Reproduction, among 
others). This research was an excellent starting point for my inquiries, in which 
I used an alternative approach to examine the periodical and which found 25 
topics that were theoretically visible, as they each referred to more than 1% of 
the amount of text. Our two approaches to the same data differ in the close 
reading perspectives, and the comparison has led me to notice other wider per- 
spectives. In the following section, I will problematise what else can be depicted 
from the results of a computer-assisted distant reading of the periodical from 
the perspective of the cultural history of science. 


Thematic Trends of the Discussion 


When beginning to discuss the quantitative results of the topic modelling, it 
is indeed no surprise that the keywords and their frequencies show a nota- 
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ble rise in astronomical topics around both transits. They are visible during 
most years as a relatively large number of texts on astronomy, with the text 
mining revealing keywords such as ‘transit, June’ and “Venus, where June’ was 
the month when the 1761 transit occurred. Astronomy and its related prac- 
tices and techniques were commonly mentioned, but in the researched time 
span, Philosophical Transactions also discussed a few non-astronomical topics. 
However, these also helped to locate relevant aspects of the scientific ways of 
talking also within astronomy, as I will demonstrate below when discussing the 
case of systematising observation results. In particular, these non-astronomical 
topics have a specific value as they can vividly illuminate what is not part of 
other discussions. 


Shared discourses, split viewpoints 


The remainder of this chapter will analyse the astronomical ways of talking by 
contextualising and connecting the prevalent topics in the published scientific 
communications (for example, in Philosophical Transactions) to the making of 
early modern knowledge. I begin with a thematic analysis of the text-mining 
results and then conclude with a summary of the approaches and temporal 
appearances of the nine ways of talking about science. Keywords of all of the 
discussed topics are listed in Appendix 14.1.” 

The data shows that the discourse on the first transit (1761) circulated 
around various viewpoints concerning reliability and the structure of the solar 
system. The common themes consisted of exactness, measuring, relating and 
belonging (to the universe), which also reflect the excitement of using the 
enthusiastically approved, state-of-the-art instruments in making observations 
of the first transit. Instead of describing causal chains, the observations were 
communicated as indicating the values of various connecting sets of rules and 
theories. The observations were thus not neutral, but became charged with 
rules and theories. 

The lists of observation data were apparently published as they had been 
recorded, and were then consequently explained through the use of mathe- 
matical algebraic reasoning (the topic ‘Astronomical distance’ with 11%), as a 
system of heavens that could be revealed (‘Rules of the stars’ topic with 5%) 
and how it related to the solar system (Appliances topic with 8%). The first 
two topics were only present in 1758-1762, while ‘Solar system’ was an ongoing 
topic, although notably only at the beginning, as it almost disappeared (1% to 
2%) around the second transit. In 1763-1767, the theme seems to split into 
two topics, as a similar topic ‘Heavenly bodies’ shortly appears at this point. 
However, it differs somewhat, as this topic focuses on the planetary system and 
its parts (which are described in a similar manner as the reports would talk 
about the human body, possibly in a ‘plain’ style), while the topic ‘Solar sys- 
tem’ discusses what can be achieved on Earth by using those heavenly bodies: 
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for example, ‘latitude’ and ‘degree’ are among its keywords, which could often 
signify the calculation of the location at sea. These distinctions already show 
that the discourse on the transits addressed many separate views on the topic, 
varying talk which not only promoted causal theories and mathematical rea- 
soning or arranging raw data ‘plainly, but also offered discussions on more 
social factors, such as the extent of trust and belief that could be placed in astro- 
nomical calculations. 

There might be some skewing of this topic, as a large number of its key- 
words might be brought about by one single lengthy paper (23 pages in print) 
in Philosophical Transactions by the Astronomer Royal Nevil Maskelyne, on 
observations and methodological rules he developed following the 1761 tran- 
sit. By collecting and comparing the data received from various observations at 
St. Helena about the Transit of Venus, Maskelyne formulated a set of theoretical 
rules about the skies and the use of astronomical instruments. During this 
time, if the transaction discourse is representative, apparently a rather small 
part (‘Rules of the stars’ topic with 5%) of the astronomical interest concerned 
astronomy's theoretical system. This theoretical discourse topic also existed at 
the time of the second transit in 1769, meaning that Maskelyne’s ideas con- 
tinued to be referred to or discussed after his paper’s initial 1761 publication. 
However, as the topic decreased from 5% to 2%, this indicates an increasing 
emphasis on other perspectives, such as a dispute on the relative importance 
of causal versus mathematical evidence. That Maskelyne’s article constituted a 
dominant part of the topic could easily be identified through a keyword search 
in the dataset, and it is worth emphasising that it is both important as well as 
useful to also apply close reading to enable closer study and control the content 
of themes that arise from the distant reading of the text-mining results. 

One alternative to using theories and mathematics to explain astronomical 
measurements was through empirical means (namely, repeated experiments or 
observations). This is demonstrated by the topic ‘Distance measuring process’ 
being continuously very strong (9-12%) throughout the period, thus repre- 
senting scientists’ interest in the general process of experimentation, while the 
keywords of the ‘Astronomical distance’ topic point towards explanations and 
mechanisms based on the laws of physics, physical circumstances and entities, 
such as velocity and distance. Apart from ‘glass’ (a reference to a material most 
likely indicating the lens of a telescope or an experiment tube), the keywords of 
the ‘Distance measurement process’ topic ( distance, ‘degrees; ‘line’ and ‘places’) 
primarily point towards intangible concepts connected to very tangible experi- 
ments, of measuring and of practically producing new observational data. 
Interestingly, its keywords of ‘appearance; ‘means’ and ‘method’ give this dis- 
cussion various kinds of unsure, unsettled and dynamic connotations and, fur- 
thermore, the topic contained discussions about particular location-specific, 
probable or ‘apparent’ aspects. 

The topic ‘Astronomical distance’ was discussed only once, in the 1758-1762 
segment, with a sudden 11% spike, in which the observations of the heavens 
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(and the astronomical signified keywords of [moon/sun’s] limb, sun, star and 
wheel) were linked with the data in the tables (measure, feet, foot, inches) 
and algebra (cos, distance, wheel). This and the ‘Distance measurement 
experiments’ topic both mention the word ‘distance’ but the other keywords 
indicate a clear difference in meaning between the two, as the topics reveal 
how ‘distance’ was to be observed, depending on whether it addressed celestial 
distances or distances on Earth. In comparing these two topics, I claim that the 
topic modelling algorithm has revealed a very important temporal change: it 
has located a mainly shared discourse about the practices of deciding on the 
distance, but it has split this theme into two. Their separate keyword lists ini- 
tially differ between practical experimentality and understanding the physical 
theories of the phenomena, but after this, both topics also share similar pas- 
sages about the use of instruments and theorising the measurements. Both 
topics discuss various means to measure distance, and the shared keywords of 
measure, distance and [the method of heaven's] wheel, which all appear in both 
of the discussed topics, capture a broader, shared discussion. This is the value 
of topic modelling: by using machine-reading and locating shared but slightly 
variable topics, the nuances of the discussions can better come to light. 

The examination of these topics indicates that there were two profoundly 
different ways of what was considered to be the reliable way to find data. The 
first transit in particular allowed plenty of interest (11%) in making arguments 
through calculations, but for the second transit the frequencies indicate that the 
mathematics were no longer a primary worry. There is only a short temporal 
change within the statistics that could, however, exemplify a dramatic change in 
the paradigm. It appears that a more general perspective about measurements 
was developing and that means other than mathematical results had become 
increasingly relevant in the early modern search for reliability and ‘truth. 

Following Newton’s formulation of mathematical laws of nature, in the 18th 
century it was usually considered sufficient to settle an astronomical dispute 
if one could arrive at a successful set of mathematical calculations.” At a time 
when logical causalities became less interesting to the scientists, the large 
number of words indicating probability in my data raises questions about 
reliability, the meanings of eye-witnessing and the capabilities of the human 
mind and senses. These topics can also indicate that there was a split discourse 
according to the two schools of realists: one consisting of mathematical realists, 
including those who thought mathematics could provide the real motions of 
celestial bodies; the other comprising the group of physical realists who held 
that a mathematical model of the real structure of the heavens should be based 
on physics.!* 

Another example of transitions within the paradigm can be noted in the texts 
written in Latin. A few data and observation reports were received and pub- 
lished in Latin, including the transit observations from Uppsala in Sweden.” 
However, this kind of discourse was narrow and concentrated into two top- 
ics: 'Solar system Latin' and "Transit and Latin: The two topics were both most 
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popular around the first transit, and while they both circulated the calculations 
for longitude, they differed in their perspectives. The first described the obser- 
vations plainly, while the other seems to have emphasised the materiality of the 
observation process. This means that in Latin there were at the time in question 
two ways of discussing astronomy: first, as descriptions of the solar system; 
and, second, in the form of the actual process of making observations of the 
solar system. It also appears that during this period both Latin topics started 
to disappear. This decline of Latin had also been noted in passing by the previ- 
ously mentioned linguistics researchers when stating that some non-thematic 
topics, including texts written in Latin, reached their peak in the early 18th 
century.'* However, as we can observe in the keyword list, Latin was still used 
in the late 18th century in publishing accounts concerning the heavens and 
the solar system. The topic ‘Solar system Latin’ was popular between 1758 
and 1767, with a 6% to 8% share of all discussions, but then almost disappeared 
(1% to 2% in the subsequent decade), while the “Transit and Latin' topic was 
visible throughout the entire transit decade, but only with a 2% to 5% share. 

The host of Latin words in ‘Transit and Latin, ‘vero; ‘inter’, ‘hoc, ‘enim’ and 
"hujus, are all abstract, frequent words. ‘Vero is the fascinating one, as it means 
‘truly, indeed, to be sure, certainly: In fact, among the 500 keywords in the 
other topics, there are many that indicate a continuous interest towards 
the ‘true, such as in addition to ‘true, also ‘purpose’ ‘order’ ‘error’, anemones, 
‘effect, ‘probability’ and ‘rules: And deriving from the keywords ‘apparent’ and 
‘error, the “Telescope observations’ topic can be seen as promoting an openness 
towards the meanings of the results and how changing circumstances could 
affect the observations, although the task in question seemed to be regarding 
some precise measurements, including ‘minutes’ and ‘seconds: 

Taken together, these topics discuss a contemporary science culture that 
was seen as being open-ended and constantly recreated by various actors." In 
particular, the certainty of probabilities and the transparency about practical 
choices when making observations seem to be continuing themes in the vari- 
ous discourses. 


Talking about the weather 


The external conditions of the observations were and are a central issue to 
astronomers. It especially appears to be the main way to talk about the astro- 
nomical observations in one topic which, due to its huge number of weather 
keywords, has been labelled ‘Astronomer’s weather. This was a strong topic 
during its time: although it only existed during the 1768-1772 period, the 
topic arrived with a notable 14% of the volume of text published in Philosophi- 
cal Transactions. The first words in this topic are ‘ditto’ ‘air’ and ‘limb [of the 
moon or the sun] which apparently refer to astronomical observation tables. 
These keywords were also accompanied by June’ and by as many as five words 
describing weather. 
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Throughout the years in question, the weather was continuously discussed, 
but there is some increase in the latter 10 years of the studied period, during 
which the second Transit of Venus took place. This sudden rise in the topic 
modelling chart is understandable in the light of the unfortunate event of 
not having been able to make successful observations of the transit on the 
first attempt in 1761 due to difficult weather. It is therefore no surprise that 
we can find some discussion about the conditions already before and after the 
1761 transit: it was anticipated that clouds could obscure visibility at the vital 
moment, as they had so often done when Edmond Halley had been observing 
a similar transit almost a century earlier." The unpredictability of weather was 
just one of the several reasons why there was a need to simultaneously observe 
the Transit of Venus from many different and varied places on Earth. Beside 
cloud cover, Halley had already in 1716 pointed out in his observation advice 
the importance of arranging coverage from northern and southern latitudes, 
as both the ingress and the egress could not be observed from all the loca- 
tions. Thus, to avoid a similar fate, a large-scale operation was connected to the 
Transits of Venus in order to coordinate several international observations and, 
as had been suggested by Halley, to arrange multiple observatory tents to be 
set up around the globe by British, Austrian and French observers.? The more 
observations that were made from many different and widely separated vantage 
points, the more accurate the ultimate results were likely to be.” 

Weather was also employed in other discourses. In the topics 'Environmental 
circumstances’ and “Travel narrative, weather figures in the form of the blame 
and complaints that often figured in early modern observation accounts. These 
often served very practical purposes, as a failure to conduct an experiment 
could often be more easily accepted due to bad weather. Simultaneously, these 
topics address the practice and materiality of the observation process in the 
form of broken instruments, successful delivery of the instruments and other 
aspects affected by bad weather. In this, the topics provide valuable tangible 
insights into early modern scientific practice. 

The topics concerning the weather were used in particular around the sec- 
ond transit (1769). The topic "Telescope observations’ appears to mention the 
weather during both transits, but only moderately, with a few percent (296 to 
496 during the 1758-1772 period), while the main emphasis is on discussing 
the way in which the new more advanced instruments could create or dimin- 
ish the observation errors. Nevil Maskelyne wrote a letter to the Royal Soci- 
ety in 1761 which discussed at length various contextual influences on the use 
of the instruments. Before listing the transit observation results, Maskelyne 
wrote about ‘the observations themselves, and mention[ed] some cautions 
concerning them; which among others included the exact practical adjustment 
of the quadrant (a navigational instrument used for angle measurements).”*”° 
The context in which the instruments were used, then, was important to the 
narrative and the manner in which they were discussed. 
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On the other hand, the failure to produce consistent and certain results dur- 
ing the scientific voyages of observation connected to the 1760s transits has 
been blamed not just on bad weather, but also on other external or material 
conditions that affected the voyage, such as war, illness and, above all, inex- 
perience in observing the phenomenon in question.” The quantitative dis- 
tant reading of results through topic modelling provides a new opportunity 
to investigate how strong the pattern has been to make arguments about the 
failure (and at times success) of experiments through referring to external cir- 
cumstances. Such talk about instruments and circumstances that affected the 
observation process attracted a wider group of readers among scientists besides 
astronomers. In the texts, therefore, such talk can also be seen as functioning 
as a way of elevating the experimenter’s status as a scientist, or as mirroring the 
scientist's thoughts or search for reinforcement from other scientists; namely, to 
write down the observations in a punctual manner was entangled with a deeper 
purpose that made visible the discourse of complimenting others or showing 
evidence of their mistakes and failures. Reid even called this kind of performa- 
tive behaviour ‘playing the astronomer, as the astronomer was in a precarious 
position on an expensive voyage as he was likely the first one among the crew 
to be seen as less important and as excess weight." 

In this light, it is interesting to examine how instruments and bodies were 
referred to, as this reflects the general values that kept being repeated in the 
accounts. Simon Schaffer has suggested that the states of disrepair in observa- 
tion reports or travel narratives refer simultaneously to the tools and to the 
humans that interact with them and with one another. As he wrote, the states 
of disrepair helped distribute responsibility across cultures and spaces, offering 
resources to defend some reputations and damn others.? From this perspec- 
tive, the referral to the external conditions and the others who contributed in 
the making ofthe observations were part of the pattern of this way of distribut- 
ing the responsibility for scientific success and failure. 

Referring to the sources and collaborators was also a matter of reliabil- 
ity. According to the semantic arrangement of the keywords in the topics 
'Astronomers weather, ‘Environmental circumstances and “Travel narra- 
tive; the materiality of objects was closely linked with the circumstances of 
observing. Maskelyne’s previously mentioned 1761 lengthy report was careful 
to mention and honour the builders and senders of the equipment,” and in a 
similar fashion while on a state-sponsored exploration voyage, another astron- 
omer in his journal listed the people who had built or fixed the instruments 
he used.’ If no observation was possible due to fog or a broken instrument, 
their role in the experiment or its possible failure was at least taken into 
consideration in the description of the event. Hence, the keyword ‘wood’ is 
interesting. This might indicate the 'honouring often offered the makers of the 
instruments used in the observations, as such compliments were written with 
full description of the types of materials used in building them and thus the 
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quantitative data also indicates the epistemological importance of the astro- 
nomical equipment. 

Talking about the weather could also be an instrument of politics, as the 
making of science was, and is, very political. In the corpus, the weather was 
at times discussed in a text with what can be described as politeness or plain 
style. This structured a discourse which proposed openness towards the vari- 
ous possibilities and active influencing on the observer's circumstances. The 
impact of weather could not be controlled, but it could be understood by sys- 
tematisation;” and as it affected the success of observations and experiments, 
it was used as both an explanation and a weapon by scientists in the politi- 
cal game of making knowledge and acquiring more funding. Royal Society 
funding affected the discourse and the attempts to increase the natural his- 
tory knowledge, and while only a nominal part of the Society's funding came 
from the king, the results and communications of scientific results needed to 
please the funders. John Henry has pointed out that this meant that the funders 
would wish to gain practical applications from the experiments and develop- 
ments of science. As a result, the Society had to be very apologetic and have its 
value clearly propagandised in its attempts to demonstrate the usefulness of 
science to the state.? In order to maximise the attention within the administra- 
tion, the importance and rarity of the observations of the singular 1761 and 
1769 Transits of Venus would have given a reason to report the events with such 
care. It was a good opportunity to emphasise the relevance of the expensive 
astronomical observations, which was brought up wherever and whenever the 
transits were mentioned. 

The final topic featuring the weather was “Travel Narrative; consisting of 
plain or technical descriptions of weather, geography and exploration events. 
The topic was very frequent and steadily present by 696 to 996 throughout the 
period, but was at its largest at the turn of the 1770s. As weather has been 
proved to be an important part of spatial descriptions of 'new' regions, which 
were especially discussed around this time, it is most likely that there was a cor- 
relation between talks of weather and such new areas with promises for impe- 
rial and scientific explorations. 

Having so many obviously different kinds of weather topics at the same 
time means that, simultaneously, many and varied aspects were considered. If 
some discussion has been very small (or unimportant) in its scale, it would not 
be visible in this kind of topic modelling of 1,421 articles. 


Talking about instruments 


Regarding the importance of various disciplines within astronomy, before 1758 
and between the transits, there was according to the statistical results more 
room for showing interest in chemistry, whereas at the end of the 25-year 
period, the data indicates a turn towards widening discussions about the use 
of experiments and optical instruments. This was connected to experiments 
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of measurable phenomena such as air pressure, distance and microscopic life 
forms, and was influenced by technological developments (besides telescopes) 
that also led to increased use of three observation instruments in the form 
of the thermometer, the barometer and the hygrometer. 

These were notable topics. The talk about the instruments and their particu- 
lar use each received 6% to 10%, respectively, of the space of the published texts. 
However, the discussion was not fully diverted from astronomical topics, as 
the development of the instruments is linked to weather discourses. In fact, 
as the three new instruments became available to more observers, it was possi- 
ble to collect much larger amounts of data of local and global weather patterns, 
so they could be systematised and understood not just as weather, but as cli- 
mates. The influence of the political interests involved in producing the ‘facts’ of 
far-afield weather should not be underestimated either. Morgan Vanek has sug- 
gested that the actual reason why 18th-century literature is saturated with the 
rhetoric of meteorological science lies in seeing the topic as a prominent and 
productive term in the public debate about Britain’s imperial obligations. She 
claims that 18th-century writers amplified the threat of environmental influ- 
ence to justify a British right to govern all over the world, as with their govern- 
ance the British could improve living conditions in the ‘new’ harsh regions.” 

Finally, the last approach to astronomy is quite different from the others as it 
addresses a rather different mindset. Interestingly, together with the perspec- 
tives on the influences of various circumstances, at the time of the 1769 transit 
there was also the topic ‘Ancient tradition (with 6% of the text mass), which 
linked the new observational data with ancient calculations and beliefs. This 
might indicate the astronomers’ certainty of the coming success in observing 
the second transit and their enthusiasm in comparing their innovations with 
other foundational theories. This could also discuss the many stages in which 
the rational thinking could be influenced when the collection of new knowl- 
edge in 17th- and 18th-century Europe was largely a cooperation among vari- 
ous specialised actors, where some collected field data and others depicted and 
analysed it. According to Francis Bacon, the different investigators (observers, 
experimenters and theoreticians) were in this required to be on guard against 
‘the idols? the profound and sometimes erroneous ideas created in ones mind, 
as well as the defective sensations of any particular individual. Indeed, it 
sounds like a difficult business to create reliable knowledge navigating the cir- 
cumstances of observation, various investigators and possible mindsets. 


Conclusion: Nine Ways of Talking about Astronomy 


In summarising the critical analysis of the topic modelling results, I suggest 
that during the third quarter of the 18th century Philosophical Transactions 
communicated astronomical topics in at least nine different ways. The actual 
events surrounding the two significant Transits of Venus and other aspects of 
astronomy naturally dominated the discussions in the years 1758-1762 and 
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1768-1772 and, connected to this, the periodical contained 24 topics that were 
common enough to be meaningful in the text-mining results. I located five 
ways of talking about astronomy that existed as temporal trends, emphasised 
only during the transit of 1761 or 1769. 

The first two ways were especially common around the 1761 transit. These 
sub-themes explored the meanings given to the systematisation of data: they 
denoted the practical ways of describing raw data and the systematisation 
of results (and in fact ended up publishing the long lists of various observa- 
tion results), or attempted to reveal a system of the universe by theorising the 
heavenly bodies. This reveals the emerging new natural philosophical ways of 
thinking about the universe. 

The next three ways of talking in astronomical topics were particularly 
common around the 1769 transit. In the 1770s, both the instruments and cir- 
cumstances of experimentation seem to have been included in a wide way of 
talking about observation, such as how the surroundings and weather can influ- 
ence the observing process. The third way of talking then points to the lived 
experience of the field practices of astronomical observation. The fourth astro- 
nomical discourse addressed weather and geographical locations as evidence for 
the technological advances and greatness of the empire that used them despite the 
environment and conditions. The material conditions offered a strong rhetori- 
cal explanation to success and failure: honouring and complaining was linked 
to collaboration between the makers and users of the materials and also to 
accepted rules of polite communication. These two ways oftalking also connect 
to Steven Shapin’s research on the role of actorship in early scientific practice 
and my data suggests that it would be relevant to further explore the language 
and style of how one communicated ones research in early modern scientific 
networks. The fifth discourse linked events to their tradition and could be seen 
to guard the human senses against erroneous input. 

The above five ways of talking were temporal trends that existed only dur- 
ing a part of the research period in question. There were also four continu- 
ous sub-themes that were observed in the data. According to the statistical 
appearance of topics, these four discourses were dynamic and their perspec- 
tive might have changed over time, but nevertheless they continued in some 
notable form. These aspects were pointed out first and foremost via the digital 
method, although in retrospect it is easier to see and contextualise them as a 
part of the data. 

The search for reliable means of observation was another theme that was char- 
acteristic in the entire period around the two Transits of Venus. The sixth astro- 
nomical way of talking concerned the materiality of instruments, their use, and 
the practice of experiments. The processes of observing and measuring would in 
this discourse be linked to trusting the results made through material objects, 
including the instruments, senses and the human body. 

The seventh discourse also concerned the search for reliability and indicates 
that there continued to be some difference in the ways of making arguments 
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through mathematical or physical causal reasoning. While generally the quest 
between the causal and mathematical evidence for new knowledge is visible 
in many topics, the nuances or sub-themes appeared in temporal turns. For 
example, the topic ‘Rules of the skies’ lessened towards the end of the studied 
period, whereas experimentality seems to have become more common. In the 
two topics that concerned the measuring of distance, the discourse was split 
into two subsequent practices, with a different focus on the means of obtaining 
reliable measurements. In other words, this study revealed a temporal variabil- 
ity in what means of observation were seen as reliable. 

The eighth perspective suggests that the Royal Society’s interaction with its 
public in the form ofits periodical typically proposed some flexibility and open- 
ness for developing knowledge, primarily through further experiments. The 
openness to probabilities indicates a structurally designed outlook on knowl- 
edge and aligns with the Society’s values that everyone should themselves test 
the reliability of accepted truths through using thorough experiments. In 1758- 
1762, the astronomical topics tackled reliability issues with algebra and by sys- 
tematising of results: the way to do this was through collecting large amounts 
of data and systematising them, which according to the results was a popular 
method regarding the first transit. At the end of the decade, in 1768-1773, the 
process of measuring and observing seems to have been more cautious in that 
is was giving more attention to probabilities, which indicates easier accept- 
ance towards new open-ended scientific hypotheses and continuously ongoing 
experimentation. During the second transit, the tables and the raw observation 
data had still been published, but now with less importance than before, as the 
data lists are nowhere near as visible when compared to earlier observations. 
This means that the mechanical approach to systematise raw data into thematic 
lists was no longer a remarkable approach. Instead, the natural laws and math- 
ematics were emerging as a more reliable explanation for proving the validity 
of the experiments. 

And, finally, the ninth way reveals the publication of the scientific reports 
to have been entangled with accepted social manners. It illuminates the impor- 
tance of the language in communicating about one’s research. 

Regarding the various individual topics, eight were discussed continuously 
across the whole period and only one of these (the ‘Distance measuring process’ 
topic) concerned astronomy. These topics displayed more general talk about 
the making of scientific knowledge through collecting and observing in the 
form of various ‘plain’ descriptions: the travel location (‘Travel narrative’), 
the events and collected specimens (‘Events, history, specimens’), events in a 
polite style (Events supposing politeness’), ‘new’ species (‘Species’) and obser- 
vation reports (‘Reports’), the last two accounting for up to 5% of the temporal 
change, although usually being less. The descriptions of these ‘standard’ obser- 
vations continued to be presented in mostly stable amounts throughout the 
period. Among the continuous topics, the relationally biggest temporal change 
from a large amount (6% or 8%) to virtually nothing took place with regard to 
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the most specific topics (relating to the states of substances and environmental 
circumstances). These findings about the ways of talking are also supported by 
the findings of Teich’s linguistics team, who noticed, interestingly, that some 
of the major changes occur for non-thematic topics, as the method brought 
out ‘the hows’ of the discourses. However, he non-thematic ‘hows; by which 
I mean the values and themes inside the discussions, are the most interest- 
ing results. These ‘hows’ describe the ways of doing or speaking (for example, 
through polite language or by expressing caution towards various probabilities). 

Many results in this study were ‘exactly what one would expect?" according 
to the research literature. Hence, it was pleasing to begin the analysis with those 
results as they seemed to confirm that the topic modelling had been carried out 
correctly. While the results in this way often confirm an earlier hypothesis, the 
real value of topic modelling is, however, in generating new ways of examin- 
ing our materials, to deform them. I see this as an opportunity to remove all 
the expected details and to reveal what surprising new findings are left. It is in 
this way that this study has shown how a well-known but manually too large 
research material can present different historical insights when interrogated 
through computational distant readings of digitised sources, thus illuminat- 
ing the underused methodological opportunities for historians. The power 
of topic modelling really emerges when we examine change of the thematic 
trends across the entire text collection. This means that the temporal changes 
become visible as the topics are listed at intervals in a longer period.? As the 
case of the talk of ‘distance’ demonstrated, the comparison of two topics of 
the same theme resulted in noticing both a general discourse and located par- 
ticular themes and also different ways to talk about them. 

The temporal change of the paradigm was not seen directly in the results 
sheet produced by running the data corpus through MALLET, but became vis- 
ible when interpreting the results through their contextualisation as a part of 
the well-known events within the history of science. As a list of words, com- 
posed statistically by an algorithm which knows nothing about the context of 
the study, the keywords per se are not the conclusion to our research questions. 
Rather, the key is the joint participations of the results and the researchers' 
insights and contextualisation, which is the approach that produces the hidden 
underlying meanings of the texts in Philosophical Transactions. Examining my 
familiar sources with this new method has, in other words, been beneficial to 
help me better understand the big phenomenon of the new emerging practice of 
science, which is a result that is much more important than just creating a gen- 
eral understanding of how scientists were talking in different ways about Venus. 


Appendix 14.1: The Topic Labels and Keywords 


Environmental circumstances: observations side air light difference lower 
years rain observation circumstances acid diameter set table latitude height 
scale electric force highest 
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Heavenly bodies: water sun parallax comet eclipse event it’s wheel earth green- 
wich crystals cos axis chance coins square moon’s nerves orbit lungs 

Astronomical distance: feet limb observations sun foot cos inches velocity star 
measure est distance meteor modes cum electrified mercury black diapason 
wheel 

Ancient tradition: birds venus transit clear blood ancient quadrant spot 
satellite vesuvius highest morning density roman horizon dist rest 
instrument charcoal observatory 

Transit and Latin: longitude light contact weight vero transit clouds inter hoc 
contained servant primarytopic enim tab south city happen sky vapours 
hujus 

Solar system: side quae moon vel stone observation clear degree read fpage 
earth sed height sin solis corpusbuild letter lpage latitude quod 

Rules of the stars: venus ditto clock center moons system veneris vertical solar 
miles column atque wood maskelyne mons suns horizontal egress meridian 
quidem 

Astronomer's weather: ditto air limb wind fluid sun time cloudy amp part 
clock salt june lowest contact bird matter clouds fixed north 

Distance measuring process: distance experiment made experiments glass case 
great appears give proper state means degrees small method observation 
general appearance line places 

Ancient manuscripts: observations inscription greek appears word fig hours 
inscriptions pag urine birch cliff emperor lead spindle steam tumor cells 
things father 

Telescope observations: sun's parallax apparent cape telescope pro paris error 
wire power west etiam rev minutes issn etruscan rest centers april seconds 

Travel narrative: made equal water observed body called half surface good 
kind sur place weather fpage received inches force white ground plate 

Events, history, specimens: water large likewise issn days manner strong thing 
william inches red applied increase place easily larger numbers discovered 
cut history 

Events supposing politeness: time found great day letter part parts matter 
title amp;c bodies proportion difference present mentioned till vol suppose 
power fire 

Species: number quantity small fig air philosophical heat sun feet appeared 
considerable fla iron species corpusbuild Ipage sea true end long 

Reports: part point amp;c greater manner parts sir put end primarytopic years 
london left observed june substance observe order lightning animal 

Hygrometer measurements French: temperature qui une proportion inch 
conductor form lgs hygrometer mountains luc's cette nous anemones deux 
open general aug sides temperatures 

Thermometer, electricity and effects: heat des thermometer degree electric- 
ity column surface series instrument effect inches heights birds electrical 
torpedo boiling hill columns top common 
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Barometer measurements French: air les quicksilver barometer height tube 
experiments dans hath pour metal sur point fish ball density expansion 
shock ces par 

States of substances with jargon: motion colour common form earth read 
kind sir learned bottom stones mercury spirit james measure england head 
lib miles proved 


Notes 


! Moxham 2016: 469. 

? Byrch 1761; Dunn 1761, The L&P V: 102-103. 

? See, e.g., the correspondence regarding the transport and arrival of the 
instruments by Maskelyne in 1760. 

See Valle 2006; McDougall- Waters, Moxham & Fyfe 2015; Moxham & Fyfe 
2018. 

Kermes, Degaetano, Khamis, Knappen & Teich. 2016a. 

I wish to express my gratitude to them for generously allowing me to use 
the data corpus that they had pre-processed in their project ‘Information 
Density and Scientific Literacy in English: Synchronic and Diachronic 
Perspectives of the Collaborative Research Centre SFB 1102’ 

? Kermes, Degaetano, Khamis, Knappen & Teich 2016b: 1929. 

Shawn, Weingart & Milligan 2013. 

Fankhauser, Knappen & Teich 2016: 498. 

Their project has already been finished and I have not collaborated in the 
research, but I was later permitted to use their open data. On another view 
upon collaboration, Matthew Kirschenbaum described the cooperation as 
also being ‘a social undertaking. It harbors networks of people who have 
been working together, sharing research, arguing, competing, and collabo- 
rating for many years. See Kirschenbaum 2016. 

1 [n addition to the 20 discussed topics (see Appendix 14.1), there were 
five topics (making 25 topics in total) that were unrelated to astronomical 
issues, and were therefore excluded from this study. The excluded six topics 
focused on medical knowledge, nutrition, mortality rates, paleontological 
finds and geographical events such as earthquakes. 

In this topic occurs the keywords fpage, corpusbuild and Ipage, which derive 
from terms used for the metadata included in the JStor text files and not 
from the original articles. This error originates in the pre-processing of 
the data: had it been noticed earlier in the research process, the corruption 
of the topic could have been avoided by including these terms in the stop 
word list. The topic has, however, been accepted as a part of the analysis in 
its partly erroneous form, as it was not considered to skew it considerably 
based on the reason that the keywords did not appear in the earliest part of 
the topic but as the 10th, 16th and 18th in the order of appearance. In topic 
modelling with MALLET, the earliest keywords have more significance in 
the analyses of the topics meaning than the later ones in the list. 
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? Maskelyne 1761. 

^ My depiction was guided by a similar analysis as in Mohr & Bogdanov 
2013: 554. 

1 Henry 2017: 168. 

16 See Çimen 2018. 

7 Topic "Transit and Latin’ uses keywords like ‘servant, which indicates polite 
letter writing and, in the case of the Royal Society, very probably a set of 
observations sent to them. 

'8 Fankhauser, Knappen & Teich 2016: 498. 

9 Golinski 2017: 181. 

? Jardine 1999: 141. 

^ Halley 1716: 246; Leverington 2003: 143. 

? Gascoigne 2000. 

23 Maskelyne 1761: 562. 

* Turner & Turner 1998: 30. 

^ Maskelyne 1761: 563. 

% Galison & Daston 2008: 316. 

? Reid 2008: 172. 

*8 Schaffer 2011: 708. 

? Ibid.: 709-710. 

? Maskelyne 1761: 560. 

1 Wales 1777: 73. 

? Hooke 1663; cf. Daston 2015. 

* Henry 2017: 174. 


3 Sargent 2012, 83. 

?* Fankhauser, Knappen & Teich 2016: 498. 
? Blevins 2010. 

** Shawn, Milligan & Weingart 2013. 

° Blevins 2010. 
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CHAPTER 15 


The Many Themes of Humanism 


Topic Modelling Humanism Discourse in Early 
19th-Century German-Language Press 


Heidi Hakkarainen and Zuhair Iftikhar 


Introduction 


Topic modelling is often described as a text-mining tool for conducting a study 
of hidden semantic structures of a text or a text corpus by extracting topics 
from a document or a collection of documents.’ Yet, instead of one singular 
method, there are various tools for topic modelling that can be utilised for 
historical research. Dynamic topic models, for example, are often constructed 
temporally year by year, which makes it possible to track and analyse the ways 
in which topics change over time.’ This chapter provides a case example on 
topic modelling historical primary sources. We are using two tools to carry out 
topic modelling, MALLET and Dynamic Topic Model (DTM), in one dataset, 
containing texts from the early 19th-century German-language press which 
have been subjected to optical character recognition (OCR). All of these texts 
were discussing humanism, which was a newly emerging concept before mid- 
century, gaining various meanings in the public discourse before, during and 
after the 1848-1849 revolutions. Yet, these multiple themes and early inter- 
pretations of humanism in the press have been previously under-studied. By 
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analysing the evolution of the topics between 1829 and 1850, this chapter aims 
to shed light on the change of the discourse surrounding humanism in the early 
19th-century German-speaking Europe. 

The concept of humanism (Humanismus) was first coined by Friedrich 
Immanuel Niethammer (1766-1848) in German-speaking Europe in 1808.2 
The concept was originally used in the pedagogical debate concerning educa- 
tion, especially in the Gymnasium. This pedagogical debate between humanist 
and philanthropist (realist) education was related to 19th-century educational 
reforms and especially to the school reform in Bavaria, which preceded the 
Prussian school reform between 1809 and 1819.* However, in addition to these 
pedagogical debates, the concept of humanism spread more widely in the 1830s 
and 1840s, and in this gained new meanings and interpretations. 

However, as the previous studies have focused on the early 19th-century 
pedagogical debates, this wider dissemination and popularisation of the new 
concept in the printed press has not been under an extensive close study? There 
exist a large number of printed publications discussing humanism already in 
the first part of the 19th century, which makes an inquiry on press debates a 
challenging task for historians.* In order to tackle this challenge of the vast 
size of potential source material, this chapter uses the quantitative method of 
computer-based ‘topic modelling’ to assist the qualitative analysis. 

By topic modelling a set containing almost 100 key texts from the years 
between 1829 and 1850, this chapter recovers several of those multiple dis- 
courses connected to humanism before, during and after the outburst of the 
1848-1849 revolutions. By combining and comparing topic modelling with 
MALLET with Dynamic Topic Modelling (DTM), this chapter seeks to map 
and analyse what kinds of topics were related to humanism before 1850 and 
how these topics changed and evolved over time. During 21 years, human- 
ism appeared in various contexts from education to philosophy, religion and 
politics. Where the MALLET, as the most well-established topic modelling 
tool within the field of digital history,’ is used in detecting the most promi- 
nent themes in the discussion on humanism, DTM makes available a finer look 
into the topics at the temporal level and, in this case study, provides a new 
kind of insight into the growing importance of temporality within the German- 
language humanism discourse between the early 1800s and the mid-century.’ 

In contrast to temporally ambitious research on huge corpuses, this chapter 
focuses on a rather small text corpus, which allows more exploration of the 
possibilities of cross-reading the material with methods of close and distant 
reading. This study of the discussion surrounding humanism before 1850 
thus provides a reasonably manageable but rich investigation of some of the 
ways in which newspapers and periodicals addressed topical issues and trans- 
ferred concepts and new ideas across political borders within the lands of the 
German Confederation. 

At the same time, we seek to explore what kinds of methodological ben- 
efits and risks are involved in the topic modelling of historical sources. The 
technique of topic modelling decides what constitutes a topic on an algorithm 
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that creates a statistical model of word clusters. It is thus not a fixed schema, but 
a variable probabilistic model that should also be treated as such. We will dem- 
onstrate how various forms of cleaning and filtering of the data can have drastic 
results on the output of the topic model. We also present and compare outputs 
from different methods of topic modelling, using the MALLET application and 
DTM, and address various methodological concerns related to topic modelling. 


Topic Modelling with MALLET 


The first essential step in describing the 19th-century German-language press 
discourse on humanism was to identify its various individual themes or top- 
ics using the quantitative method of topic modelling. Topic modelling has its 
roots in information retrieval, natural language processing and machine learn- 
ing. This probabilistic tool has attracted attention among historians, because 
it enables detecting underlying thematic structures behind a large corpus of 
documents, as well as surprising connections between individual texts. A topic 
comprises a distribution of words. A single document is assumed to contain 
words about multiple topics within the whole dataset. Each word is drawn from 
those dataset topics. The study used two topic modelling tools where the first 
is called MALLET (Machine Learning for Language Toolkit, version 2.0.8.), 
which is an open source Java-based software package for natural language pro- 
cessing using Latent Dirichlet Allocation technique (LDA).? 

Before using MALLET and in pre-preparation, the machined encoded OCR 
German-language press texts were cleaned and corrected manually (especially 
the recurring problem with some Unicode characters). In some cases, this 
included shortening the texts by excluding clearly irrelevant sections. 

The model was then made with the 'optimise-interval' command, which sets 
each topic’s probabilistic Dirichlet parameter that indicates the topic’s propor- 
tion in the whole dataset, and gives a better fit to the data by allowing some 
topics to be more prominent. In addition, the number of topics to be identified 
by MALLET is set beforehand as there is no ‘natural’ number of topics in a cor- 
pus, but this part requires manual evaluation and iteration by the researchers.'? 
Both MALLET and the DTM tool only mechanically detect topics and assign 
them numeric values, whereas identifying and naming the topics (that is, deter- 
mining and labelling the thematic categories found by the machine reading) is 
something the human researcher has to carry out using manual reading. And 
this is an act of interpretation. 


Topic Modelling with DTM 


Within probabilistic topic modelling, LDA is a frequently used technique and 
its MALLET implementation has traditionally been the most popular tool to 
analyse historical corpuses. Ever since topic modelling was first introduced in 
the early 2000s, there have been new extensions that help to model temporal 
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relationships. One shortcoming of the LDA method is that it assumes that the 
order of documents is irrelevant. But if we — as historians are often prone to 
- want to discover the evolution of topics over time, then we have to take the 
time sequence into account. DTM attempts to overcome this shortcoming and 
captures the dynamics of how topics emerge and change over time.” 

DTM is designed to explicitly model the ways in which topics evolve over 
time and to give qualitative insights into the changing composition of the 
source material. However, it is not the only such tool available and it has also 
been subjected to critique for penalising large changes from year to year." 
The DTM is a probabilistic time series model, which is designed to track and 
analyse the ways in which latent topics change over time within a large set of 
documents. For example, David M. Blei and John D. Lafferty demonstrated 
the functioning of DTM by investigating topics of the journal Science between 
1880 and 2000." Our case study is based on a small source corpus, which, as we 
will soon see, was one important factor in the output from the dynamic topic 
modelling. Because of the small size of the dataset, cleaning and filtering the 
data had a major impact on DTM's output. The more historical sources were 
pre-processed, the more stable the model became. 

As mentioned above, few but not all text files were reviewed for common mis- 
takes and in a few instances some mistakes were manually corrected. Python's 
Natural Language Toolkit (nltk) library was used for the pre-processing and 
filtering ofthe texts. Prior to passing on the text data to DTM tools, the text was 
processed using the following pre-processing pipeline: 


1. Punctuation and numbers removal. Punctuation characters within and 
around all the words were deleted and all the other characters except 
alphabetic characters were removed. 

2. Stop words removal. This is a common operation when processing text 
in any domain. The list of German stop words was initially taken from the 
nltk library and MALLET tool. This list was extended by reviewing the 
texts and some words deemed to be useless were then added to the list. 
Any words in the stop words list were removed in pre-processing. 

3. Stemming and lemmatising. Stemming is the process by which a word 
is reduced to its base form and all the inflectional forms of a word are 
reduced to a single base stem. Using language dictionaries, lemmatisation 
converts a word to its base lemma. This is the word from which all the 
inflectional forms are derived. The base stem is then used by the lemma- 
tisers to find the base lemma, which is then kept in the text. 

4. Classification. The words were then classified into different parts-of- 
speech with the goal being to keep various nouns and verbs identified 
in the input texts. Words which belong to other parts-of-speech were 
removed. 

5. Rare words. As a final step, the words which appear only once in the 
whole input corpus were also removed. 
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We then used Gensim (Python library) to run the DTM tool. After creating 
various outputs of models with 5 to 20 topics, as for the previous analysis using 
MALLET, we decided to limit the number of topics to 10. Like MALLET, DTM 
also gives keywords (that is, a cluster of words relevant to the topic), which help 
to identify the topic. 


Source Material 


The source material used in this study is a sub-dataset from the digital cor- 
pus Austrian Newspapers Online (ANNO), provided by the Austrian National 
Library (at http://anno.onb.ac.at). The digital ANNO collection contains 
around 20 million pages of German-speaking newspapers and periodicals 
that are available for full text searches." The Austrian National Library at their 
ANNO-portal provides an OCR tool for machine encoded optically recognised 
text which, although not totally reliable and contains errors, can be used for the 
digital analysis. 

According to the full text search engine of the ANNO portal, the word 
Humanismus (humanism) was mentioned 326 times in the press between 1808 
and 1850." Because the old German Fraktur typeface is challenging for OCR, 
the results should not be interpreted as entirely reliable, but as giving an indica- 
tion of the scale of use, how much this word was circulated in the press. In some 
texts, humanism appeared only once in passing, while in others it was men- 
tioned several times and discussed explicitly. Based on their relevance, length 
and readability, we have selected 95 key texts for topic modelling analysis (see 
Appendix 15.1). These texts include book reviews, articles, news, feuilleton 
writings and political reports, while reprints, short notices, adverts and obitu- 
aries have been excluded. 

Figure 15.1 illustrates the publishing centres and various publications that 
make up the dataset. The graph is made with the Gephi visualisation application 
and it aims to depict the source material in a visually conceivable way. Moreo- 
ver, Gephi is a frequently used software tool for network analysis, because it 
enables the portrayal and analysis of relationships or interaction between per- 
sons, entities and objects, such as geographical places or publications.'* The 
objects (nodes) and their relationships (edges) can be presented in many dif- 
ferent ways. In this case, the layout was made manually instead of choosing 
one of the most popular layout algorithms such as Force Atlas or Fruchterman 
Reingold. The nodes and edges tables were imported to Gephi as CSV files and 
in the edges table the connection between a publication and its place of pub- 
lishing gained ‘weight’ in accordance to the amount of texts discussing human- 
ism in that particular publication during the period between 1829 and 1850. 
The more humanism was mentioned, the thicker the line between a newspaper 
or a magazine and the city in which it was published. Accordingly, the strength 
of connections indicates which were the most important publishing centres 
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Figure 15.1: Network diagram of publications and publishing centres of the 
dataset. Source: Authors. 


and highlights publications that most extensively dealt with humanism in this 
dataset. Even though the ANNO source corpus is partial and dominated by 
Austrian newspapers and magazines, Figure 15.1 shows that the early 19th- 
century discussion on humanism surpassed political borders within the Ger- 
man Confederation spreading in the area of fragmented German lands and 
German-speaking parts of the Habsburg Empire. Vienna, Leipzig and Berlin 
were the most important publishing centres and the literary journals Blätter 
fiir literarische Unterhaltung and Literarische Zeitung dealt with the topic most 
extensively, although the publications dealing with humanism ranged from 
daily newspapers to religious magazines and satirical journals. 


German Humanism According to MALLET Topic Modelling 


Initial details about MALLET are summarised in the previous section. Below 
are the eight topics in order of prevalence with their top words as discovered 
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by MALLET when asked to determine the 10 most prevalent topics and as 
labelled (education, reformation, etc.) by us. The number of topics was chosen 
after experimenting with different kinds of models and 10 topics were cho- 
sen as a best way for modelling the source corpus, which was small and frag- 
mented. Topic modelling usually involves filtering away so-called stop words, 
non-informative frequently appearing words such as articles, particles and 
pronouns. However, especially when it comes to creating a model with a small 
number of topics, pre-processing the data has a danger of compromising the 
results as the researcher makes decisions on removal of stop words according to 
her or his pre-understanding, thus projecting into the data certain presupposi- 
tions regarding what is important in the corpus." Accordingly, in this model, 
no pre-filtering of stop words was carried out before the analysis, but two topics 
that contained only stop words were filtered out after creating the model. See 
Appendix 15.2 for the whole model. 


Religion: fich menschen gott religion find juden zukunft religidsen gottes 
humanismus mensch christenthum christliche niht darum demokratie 
humanitát christlichen christen theorie 

Education: erziehung schulen lehrer sprache bildung seyn gymnasien unter- 
richt realismus sprachen realschulen schüler jugend individuum wissen- 
schaften anstalten schrift realschule 

Revolution: wurde freiheit volk stadt wurden berlin revolution kammer bald 
volkes vólker waren heute republik straßen preußen fast macht bürgerwehr 
haufen 

Philosophy: fich philosophie ruge find nationalismus princip paris jahrbücher 
literatur preußen geschrieben briefe socialismus anfichten brief patriotis- 
mus rage artikel principien staatsanwalt 

Reformation: kirche fich universitáten luther reformation staat lehre staats 
reform gemeinden schottischen glaubens bloß kirchen verfassung staate 
theologen lehrer wissenschaft hervor 

Death penalty: todesstrafe sei abg verbrechen strafe habe amendement antrag 
könne man dieß gesetze redner verbrecher abgeschafft jury wolle abschaf- 
fung angenommen gegen 

Press debate: dafs christlichen philologie gegner muss zeitung liberalismus 
sache sinne bedeutung gesinnung jedenfalls artikel presse giebt philologen 
meinung klassischenmonarchischen christliche 

Social issues: fie the fich hamburg euch gesehen zigeuner habt bey ift diefe 
wiffen feine sprachen stadt armen glück schüler jhr their 


The output from MALLET provides eight topics with different keywords. In 
the ‘Education’ topic, words like Erziehung (education/upbringing), Schu- 
len (schools), Lehrer (teacher) and Sprache (language) are clustered together 
with such difficult-to-translate German concepts like Bildung and Gymnasien, 
which indicate that this topic is related to the educational debates about the 
role of humanism in the modern schooling system that were a very impor- 
tant issue in the era of comprehensive school reforms. After all, the concept of 
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Humanismus (humanism) was, as mentioned above, first coined as a pedagogi- 
cal concept, fostering classical education and the study of classical languages." 
The ‘Reformation’ topic, on the other hand, contains words like Kirche (church), 
Universitüten (universities), Luther (Luther) and Reformation (reformation), 
which give reason to believe that this topic deals with humanism historically in 
relation to Martin Luther and the reformation era. 

However, in addition to these highly obvious and clear results, there are also 
topical word clusters which show a completely different kind of interpretation 
of humanism. The topic ‘Philosophy, for instance, contains words like Philoso- 
phie (philosophy), Ruge (Ruge), Nationalismus (nationalism), Princip (prin- 
ciple) and Paris (Paris). All of these words are connected to the philosopher 
Arnold Ruge (1802-1880), who was also a political writer, associated with the 
Young Hegelians and Karl Marx, and known for his radical ideas that religion 
should be separated from politics and intellectual thinking. Ruge was one of 
the main figures who in the 1840s introduced a new interpretation of human- 
ism as a political concept and his ideas were highly debated in the press.” For 
Ruge, humanism meant political emancipation from the old ancien régime. 
He incorporated humanism in democratic-republican ideology, which com- 
bined social critique with critique towards religion and growing nationalism. 
Humanism meant political, religious and social freedom, which was universal 
for the whole of mankind and superseded national borders. Accordingly, in 
Geschichtliche Grundbegriffe, Ruges interpretation of humanism is called kos- 
mopolitischer Humanismus (cosmopolitan humanism).” 

This radical new political meaning of the concept of humanism is also vis- 
ible in topics that dealt with social problems and political issues like the death 
sentence and the 1848-1849 revolution. For example, the topic labelled ‘Social 
issues contains keywords like Zigouner (gypsies), Armen (the poor), Stadt (city) 
and Glück (happiness). Again, the topic ‘Death penalty’ is clustering together 
words like Todesstrafe (death penalty), Verbrechen (felony), Strafe (punish- 
ment) and Amendement (amendment), which are all related to the debates 
around abolishment of the death penalty, which was a topical issue especially 
in Austria around 1849. Moreover, topic modelling of the dataset reveals a topic 
relating explicitly to the European revolutions in 1848-1849. This topic labelled 
with the title ‘Revolution’ contains the following keywords: wurde (came), Frei- 
heit (freedom), Volk (people), Stadt (city), Berlin (Berlin), Revolution (revolu- 
tion), bald (soon), heute (today), Republik (republic) Straßen (streets), Macht 
(power), Bürgerwehr (militia) and Haufen (pile). This topic, especially, indi- 
cates how humanism became a political concept in the 1840s when both early 
socialists and liberals adopted humanism in their political language as they 
demanded political emancipation from the old regime.” 

This result demonstrates the diversity of the meanings given to humanism in 
the early 19th-century press. In addition to educational debates, humanism 
also appeared in the discussions surrounding social and moral issues, law and 
politics. In fact, the extremely diverse topics of humanism indicate a pervasive 
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reorganising of ideas related to the human being and his or her place in the uni- 
verse in the post-Napoleonic era, in which the liberal bourgeoisie was gaining a 
new foothold in society at the same time that the Church and absolutist power 
were challenged in the aftermath of the French Revolution. This transformative 
era created new interpretations on how politics, religion, education and phil- 
osophical thinking should be organised in modern secularising society, and, 
despite the practices of censorship especially in Prussia and Austria,” the press 
played a major role in circulating these ideas among a growing readership. 

Consequently, the vast processes of secularisation and modernisation help 
us to understand why the ‘Religion’ topic was the most dominant theme in the 
early 19th-century press discussion on humanism. This most prevalent topic 
contains many interesting keywords indicating how discourses surrounding 
religion, morality and politics were actually significantly entangled in the early 
19th-century discussion on humanism. The clustering of words like Menschen 
(human being), Gott (god), Religion (religion), Juden (Jews), Zukunft (future), 
Humanismus (humanism), Christenthum (Christianity), Demokratie (democ- 
racy), Humanität (humanity) and Theorie (theory) is a good example of the 
interpretative challenges that take place when identifying and labelling topics 
that are not cohesive but multifaceted and extremely complex. We will examine 
the ‘Religion’ topic closer below using DTM. But first, we will locate which 
years this topic emerged most dominantly between 1829 and 1850. 

Following the task of identifying topics, it is vital to also explore them and 
their meanings in the historical context in which they came to life. In other 
words, it is essential to acknowledge the temporality of the topics and study 
them from in a dynamic historical perspective. For example, the volume 
ofthe press was very different in 1829 and in 1850. Furthermore, the new Young 
Hegelian philosophical ideas and growing interest in social issues was part of 
the intellectual and social landscape of the 1840s and it goes without saying that 
the outbreak of the revolutions in 1848 was clearly a major historical event that 
impacted on the public discourse surrounding humanism. 

Without additional programming, MALLET does not present the topics in 
relation to time. Yet, it is possible to inspect the dynamic temporal aspect of 
the topics by organising the dataset chronologically.” Accordingly, the files 
of the dataset were numbered from the oldest, in this case 1829, to the young- 
est, here 1850. This means that it is now possible to study how topics emerged 
and changed over time (Figure 15.2). In Figure 15.2, the two stop word topics 
are filtered out, presenting only the eight relevant topics. 

We can now see the thematic trends and how the topic patterns change 
over time. The figure above indicates that before 1840 ‘Education’ and ‘Social 
issues’ were important topics in relation to humanism, but in 1848 the topic 
‘Revolution’ became dominant. In 1849, it was replaced as the leading topic 
by ‘Death penalty; with ‘Religion’ following in prevalence. The ‘Religion’ topic 
gained importance especially immediately after the revolution, which could 
indicate a reaction to the turbulence and violence in 1848-1849. Yet, despite 
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Figure 15.2: Annual allocation of topics. Source: Authors. 


the chronological aspect, MALLET’s results are always compressed and cannot 
give any further insight into the dynamics within the topics that have been dis- 
covered. In the next section, we will further analyse how dynamic topic mod- 
eling (DTM) can make it possible to gain insight about the dynamics within 
one singular topic. 


Discovering Temporalisation of 
German Humanism with DTM 


Preliminary details about DTM and text pre-processing details are mentioned 
above. Ihe cleaning of the data had a major impact on the output ofthe model. 
At first, the results were very similar to the MALLET analysis and many topics 
seen before persisted. For example, humanism continued to emerge in relation 
to the topic of ‘Religion’ [‘menschen; 'humanismus, ‘humanitat, ‘zukunft, ‘ste’ 
‘stch, ‘religion, ‘wahrheit, ‘bloß demokratie, ‘christenthum; ‘recht, 'gegenwart; 
fich, ‘wohl]]. Also, the topics ‘Education’ ['bildung, ‘mehr; 'erziehung, ‘zeit, 
‘lehrer, ‘jugend, ‘realschulen, ‘find, ‘wissenschaft, ‘sache, ‘neuen, ‘immer, gym- 
nasien, ‘zweck, ‘mittel’], Death penalty’ ['sei; ‘verbrechen, ‘abg, könne) ‘redner’, 
‘schon, ‘antrag, amendement, ‘angenommen; ‘staat’ 'dief?; ‘abgeschafft, ‘ab, ‘be; 


© 


‘abschaffung’] and ‘Revolution’ [‘wurde; ‘macht, vólker, volk, geschichte, ‘bald, 
“freiheit, ‘berlin; ‘volkes; ‘revolution, ‘werk, ‘wurden; ‘je; regierung, ‘tage’] were 
remarkably similar. However, there were also changes. Social issues and debates 
around Ruge’s interpretation of humanism were more in the background and 
there was more than one category relating to religion and education. 
Furthermore, with DTM, we had more fine-tuned results as the source 
corpus was divided into different time frames and keywords were arranged 


year by year. As the keywords appeared in a list from most important to least 
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1829 "menschen, ”humanismus, ”humanität; "zukunft 
1830 "menschen, *humanismus, ”humanität; "zukunft? 
1831 "menschen, ”humanismus, ”humanität; "zukunft 
1832 "menschen, ”humanismus, ”humanität; "zukunft? 
1833 "menschen, *humanismus, ”humanität; "zukunft? 
1834 "menschen, ”humanismus, ”humanität; "zukunft? 
1835 "menschen, *humanismus, ”humanität; "zukunft? 
1836 "menschen, ”humanismus, ”humanität; "zukunft 
1837 "menschen, *humanismus, ”humanität; "zukunft? 
1838 "menschen, *humanismus, "zukunft, ”humanität' 
1839 "menschen; ”humanismus, "zukunft, ”humanität' 
1840 "menschen; *humanismus, "zukunft, ”humanität' 
1841 "menschen, "zukunft, ”humanismus, ”humanität' 
1842 "menschen, "zukunft, ”humanität; *humanismus’ 
1843 "menschen, "zukunft, ”humanität; *humanismus’ 
1844 "menschen, "zukunft, ”humanität; *humanismus’ 
1845 "menschen, "zukunft, ”humanität; *humanismus’ 
1846 "menschen, "zukunft, ”humanität; *humanismus’ 
1847 zukunft "menschen ”humanität; *humanismus’ 
1848 "zukunft "menschen ”humanismus, ”humanität' 
1849 "zukunft "menschen ”humanismus, ”humanität' 
1850 "zukunft, "menschen ”humanismus, ”humanität' 


Figure 15.3: Output from the DTM before data cleaning, including the four 
first keywords. Source: Authors. 


important, it was possible to detect the ways in which the order of these key- 
words changed within one singular topic. The most striking new discovery 
with DTM was that there were cases in which words with temporal meaning 
such as Zeit (time) or Zukunft (future) became increasingly important towards 
mid-century. This discovery resonates strongly with the conceptual historian 
Reinhart Koselleck’s famous argument that the early 19th century was a Sat- 
telzeit, a period in which the notion of time changed radically and concepts 
became increasingly abstract and more future-oriented. Koselleck suggested 
that as modern concepts became more entangled with historical time, being 
associated increasingly with the past, the present and the future, the phenom- 
ena which previously were seen as static and unchanging became conceived as 
dynamic processes." 

To give an example, in Figure 15.3 we have the four most important words for 
the topic ‘Religion, containing words like Menschen (human being), Humanis- 
mus (humanism), Zukunft (future), Humanität (humanity), Religion (religion), 
Wahrheit (truth), Demokratie (democracy), Christenthum (Christianity), Recht 
(justice) and Gegenwart (present), which are very similar to those words seen 
in the most prevalent Religion topic in the MALLET results. 

However, here the topic seems to be relating more to human beings and 
morality rather than religion. In addition, the meaning of the word Zukunft 
(future) is of special interest here, as its position changes radically between 


270 Digital Histories 


1829 "menschen, gott, ‘religion, "humanismus, zukunft, 
1830 "menschen, gott, ‘religion, "humanismus, zukunft, 
1831 menschen, gott ‘religion, "humanismus, zukunft, 
1832 menschen, gott, ‘religion, "humanismus, zukunft 
1833 menschen, gott, ‘religion, "humanismus, "zukunft, 
1834 menschen, gott, ‘religion, "humanismus, zukunft 
1835 menschen, ‘gott, ‘religion, humanismus, zukunft, 
1836 "menschen, ‘gott, ‘religion, humanismus, ’zukunft, 


» 


1837 menschen, gott ‘religion, ‘humanismus, zukunft, 


» 


1838 "menschen, gott ‘religion, humanismus, zukunft, 
1839 "menschen, gott ‘religion, "humanismus, zukunft, 


» 


1840 "menschen, gott ‘religion, "humanismus, "zukunft, 


1841 "menschen, gott ‘religion, —”humanismus; zukunft, 


> 


1842 "menschen, ‘gott, religion, humanismus, "zukunft, 


1843 "menschen, ‘gott, ‘religion, humanismus, zukunft, 


1844 "menschen, ‘gott, ‘religion, — "zukunft, "humanismus, 
1845 "menschen, gott ‘religion, zukunft, "humanismus, 
1846 "menschen, gott ‘religion, "zukunft, ”humanismus, 
1847 "menschen, gott ‘religion, "zukunft, ”humanismus, 
1848 "menschen, ‘gott, ‘religion, "zukunft, ”humanismus, 
1849 "menschen, gott ‘religion, "zukunft, ”humanismus, 
1850 "menschen, ‘gott, ‘religion, "zukunft, ”humanismus, 


Figure 15.4: Output from the DTM after data cleaning, including the five first 
keywords. Source: Authors. 


1829 and 1850. Figure 15.3 shows the output from the DTM before data clean- 
ing, including the four first keywords. In post-cleaning, the letters ‘ste’ were 
filtered out. 

However, this striking change did not appear in all the outputs, but the more 
we removed stop words and filtered the data for better results, the more stable 
the topic appeared (Figure 15.4). In addition, the word God (Gott), which is 
missing in the first output together with religion (Religion), is now continuously 
the second most important word after human being (Mensch). The information 
about the proposition of each word within the topic indicates that changes were 
so minor that altering the script by removing stop words and removing words 
that appeared only once changed and stabilised the model to the extent that 
changes could no longer be seen in the order of the keywords.” 

Yet, to give another example, the word Zeit (time) became increasingly impor- 
tant in another topic that included keywords such as Wissenschaft (science/ 
knowledge) and Erziehung (education/upbringing). The change is visible both 
before and after filtering stop words. The Dirichlet parameter indicates that the 
weight of the word Zeit did not increase, but the growing importance resulted 
from the fact that the importance of the word Wissenschaft decreased radically 
around 1846.” This was a modest change, but it persisted in the outputs made 
before and after removing the stop words and carrying out other data filtering, 
such as removing words that appeared only once. 
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1829-1850 


Philology 
Church history 


Philosophical tendencies 


Revolution & Distribution of power 


Political debate 


Philosophy of science 


Education of the Jews 


Education 


Religion, Morality & Relationship to God 


Study of languages 


Figure 15.5: Topics detected by the DTM tool after data cleaning. Source: 
Authors. 


In the end, after data cleaning and filtering the historical sources, the DTM 
tool provided a list of the 10 most prevalent topics in the early 19th-century 
press (Figure 15.5). Yet, because of the short timeline and small size of the 
source corpus, the final output provided very static results and only very small 
changes within these topics were able to be discovered. 

However, it is important to bear in mind that the dataset used in this case 
study was small. A larger dataset together with a potentially longer timeline 
would probably make it possible to detect and analyse more drastic changes 
over time. In any case, both of these examples illustrate that topic models are 
first and foremost probabilistic models providing estimates of the most sali- 
ent discourse topics. Semantic changes are related to probabilistic proportional 
changes (in topic word list) and examining the probability distribution param- 
eters (values associated with topic words in the output) is vital for understand- 
ing how these models work in practice. 


Conclusions 


This study has investigated the early 19th-century German press discourse 
on humanism, which has been an under-researched area to date. In this 
chapter, we have modelled the topics of humanism in the early 19th-century 
German-language press with MALLET and DTM. By analysing the evolution 
of the topics between 1829 and 1850, this chapter has explored the change of 
the discourse surrounding humanism in early 19th-century German-speaking 
Europe. Both topic modelling applications detected different topics among 
the text corpus and recognised different semantic categories in the early 19th- 
century German-language source material without any understanding of the 
substance or context of these texts. 
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Topic modelling contains various methods, which can be used for different 
purposes. As we have shown, topic modelling can provide assistance for histor- 
ical research as a tool for analysis and interpretation. In this study, we created 
different topic models of a dataset that was relatively small and could be closely 
read in addition to distant reading. Both MALLET and the DTM tool not only 
enable us to identify thematic categories (that is, topics within the dataset), but 
they also make it possible to trace these topics back to file level. The outputs 
produced detailed results on how each topic appeared in each of the 95 texts 
of the dataset, which makes it possible to trace topics back to the level of indi- 
vidual articles for close reading analysis. If one is especially interested in, say, 
Revolution as a press topic, one could select and read all the news articles and 
other texts in which this topic appeared during the time frame of 1829 to 1850. 
This kind of assistance is invaluable for mapping and assessing sources, which 
is often laborious and time-consuming. 

At the same time, our study also sheds light on the potential benefits and 
risks of topic modelling within historical research. From a methodological per- 
spective, it is important to bear in mind that although topic modelling might 
produce highly compelling results, the analysis of these results demands time, 
skills and caution. One has to remember that results can vary depending upon 
the input topic number, size of the dataset, specific tool used for topic model- 
ling, data cleaning and methods of filtering. Topic modelling provides assis- 
tance for historical research as a tool for analysis and interpretation, but the 
output of a topic modelling process is not a result in itself and needs to be 
studied further for reliable conclusions. Topic modelling results can answer a 
historian's intuitive questions by providing focus and direction to the analysis 
of historical corpuses through traditional methods of historical inquiry, source 
criticism, close reading and contextualisation. Perhaps even more importantly, 
topic modelling has the potential to challenge established patterns of thought 
and underlying presumptions by providing a completely different angle on his- 
torical sources. 


Appendix 15.1: Dataset 
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Appendix 15.2: MALLET Tool 


Topic ID | Topic label Dirichlet | Keywords 
parameter 
0 [EDUCATION] 0,12609 erziehung schulen lehrer sprache 


bildung seyn gymnasien unterricht 
realismus sprachen realschulen schiiler 
jugend individuum wissenschaften 
anstalten schrift realschule 


1 [REFORMATION] | 0,07934 kirche fich universitäten luther 
reformation staat lehre staats reform 
gemeinden schottischen glaubens bloß 
kirchen verfassung staate theologen 
lehrer wissenschaft hervor 


2 [PHILOSOPHY] 0,09784 fich philosophie ruge find nationalismus 
princip paris jahrbücher literatur 
preußen geschrieben briefe socialis- 
mus anfichten brief patriotismus rage 
artikel principien staatsanwalt 


3 [RELIGION] 0,18337 fich menschen gott religion find juden 
zukunft religiósen gottes humanismus 
mensch christenthum christliche niht 
darum demokratie humanitat 
christlichen christen theorie 


4 [SOCIAL ISSUES] | 0,05893 fie the fich hamburg euch gesehen 
zigeuner habt bey ift diefe wiflen feine 
sprachen stadt armen glück schüler 
jhr their 


5 [STOP WORDS] 1,43466 mit aber nur man hat noch diese zeit 
welche haben mehr gegen denn selbst 
uns alle ohne ihm sondern leben 


6 [PRESS DEBATE / | 0,06722 dafs christlichen philologie gegner 
CONTROVERSY muss zeitung liberalismus sache sinne 
bedeutung gesinnung jedenfalls artikel 
presse giebt philologen meinung 
klassischen monarchischen christliche 


vi [LAW / DEATH 0,07172 todesstrafe sei abg verbrechen strafe 

PENALTY] habe amendement antrag kónne 
man dieß gesetze redner verbrecher 
abgeschafft jury wolle abschaffung 
angenommen gegen 


8 [REVOLUTION] 0,11352 wurde freiheit volk stadt wurden 
berlin revolution kammer bald volkes 
völker waren heute republik straßen 
preußen fast macht bürgerwehr haufen 


9 [STOP WORDS] 0,07725 fich ift find feine fein diefe diefer feiner 
fei fondern nnd felbfi ihm nichts 
zwifchen diefem fchon lehre fehr wol 
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Appendix 15.3: DTM Tool 


10 topics found by DTM after data cleaning: 


Philology: philologie alterthums wissenschaft studien zeit bedeutung artikel 
klassischen richtung gymnasien weise zeigt damals gewinnen darauf 

Church history: kirche deutschen humanismus ganz zeit Deutschland 
große ruge macht geschichte staat bald reformation freiheit princip hätte 
jahrhunderts 

Philosophical tendencies: tendenz so deutfchen bey welt bildet herr wißen 
vermógen gerade briefe feuerbach menfchen diese 

Revolution & Distribution of power: schon geht volk bleibt freiheit berlin 
verbrechen hand viele ersten fall davon gut 

Political debate: mittel freien entwickelung liberalismus gemacht monarchis- 
chen indessen ganz bedeutung zeit glaubens weder regierung 

Philosophy of science: geist denen idee lebens welt einzelnen vielmehr leben 
einzelnen philosophie schule recht staat partei 

Education of the Jews: schon juden wenig sinne schule allgemeinen mag 
allerdings beziehung irgend sagen christlichen óffentlichen wenigstens 

Education: bildung zeit erziehung schon humanismus schüler leben lehrer 
immer seyn kraft besonders deutsche ganz wohl allein aufgabe 

Religion, Morality & Relationship to God: got zukunft humanitát menschli- 
chenleben wahre christenthum mensch freiheit demokratie wahrheit 
religiósen sagen gewalt welt politik 

Study of languages: sprache sprachen zeit realschulen gesehen welt bildung 
amburg jugend schulen find gymnasien erfahrung habt werke neuen element 


Notes 


! See further Erling 2014: esp. 58-59, and Jacobi, Atteveldt & Welbers 2015. 

? Blei & Lafferty 2006. 

> The concept of Humanismus was coined in 1808 when Niethammer used 
it in his book Der Streit des Philanthropinismus und Humanismus in der 
Theorie des Erziehungs-Unterrichts unsrer Zeit. However, the tradition of 
German humanism dates back to the 15th and 16th centuries, when the 
ideas of Italian renaissance humanism spread across Europe. Accordingly, 
such concepts as humanitas and studia humanitatis are much older origin, 
dating back to antiquity. 

^ See further Bollenbeck 1994: 142-155. 

? In addition to Georg Bollenbecks book, the most extensive studies dis- 
cussing 19th-century German humanism are by Landfester 1988 and van 
Bommel 2015. 

é The book industry and the press were both growing in volume in the first 
part of the century, expanding even more dramatically as the 19th century 
neared its close. See further Erling & Tatlock 2014. Cf. St. Clair 2004. 
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7 See further Brauer & Fridlund 2013: 159. 

8 Cf. Steinmetz & Freeden 2017: 2, 5. 

? LDA (Latent Dirichlet allocation) was developed by David Blei and oth- 
ers in 2003 and MALLET (MAchine Learning for LanguagE Toolkit) was 
written by Andrew McCallum. For more information, see http://mallet. 
cs.umass.edu/about.php. See also The Programming Historian tutorial on 
MALLET, Graham, Weingart & Milligan 2012. 

10 The model provides as output three different files: topic ‘state’ assigning each 
word in the text to a topic, ‘topic keys’ consisting of the top words for 
each identified topic and the topic ‘composition’ consisting of allocation 
of percentage of every topic in each of the 95 files that were included in 
the analysis. 

1 DTM, Blei & Lafferty 2006. 

? Hall, Jurafsly & Manning 2008: 364. 

3 Blei & Lafferty 2006. 

14 See ANNO webpage. 

15 See further, Hakkarainen 2020: 27-28. 

1€ See further, e.g., Cordell 2015. 

" Cf. Schonfield, Magnusson & Mimno 2017. 

35 See further Bollenbeck 1994: 142-155; van Bommel 2015. 

? See, e.g., Anon. 1846; Anon. 1 & 4 August 1847; Anon. 1848. 

2 Bódeker 1982: 1123-1124. 

?' [bid.:1121-1126. See also Hansson 1999: 77-106. 

? See further Dussel 2011: 25-34; Stóber 2014: 141-142. 

? Cf. Blevins 2010. 

^ Koselleck 1985. See also Steinmetz & Freeden 2017: 2, 5. 

^ However, even after filtering there was a minor increase in the percentage. 
In 1829, the proportional number for the word ‘Zukunft’ was 0.010673; in 
1850, it was 0.016537. 

% In 1829, the proportional number for the word ‘Wissenschaft was 0.01164- 
2238616473554; in 1850, it was 0.006096759758556382. 
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CHAPTER l6 


Manuscripts, Qualitative Analysis and 
Features on Vectors 


An Attempt for a Synthesis of Conventional and 
Computational Methods in the Attribution of Late 
Medieval Anti-Heretical Treatises 


Reima Välimäki, Aleksi Vesanto, Anni Hella, 
Adam Poznanski and Filip Ginter 


Introduction 


Authorship and originality were tricky things in medieval literature and docu- 
ments. They were written in a culture of imitation rather than originality. The 
Latin word auctoritas could mean both an author and their authority, usually 
both combined. Auctoritas had initially meant the quality by which a person 
can be trusted. Consequently, it came to mean the authoritative status of a per- 
son and further that of their writings.' So, the ‘author was not just any writer 
whose texts were read, but the modern equivalent to their status would be some- 
thing like that of Judith Butler in gender studies or Max Weber in sociology. 
Often, only writers meriting the status of auctoritas were explicitly cited, others 
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silently borrowed and, in modern terms, plagiarised. Thus, it is not uncommon 
to find late medieval theological treatises where long passages are copied from 
other high and late medieval works, but only patristic sources and the most 
important medieval theologians such as Bernard of Clairvaux are named. As 
the 'authorship in medieval discourse was more related to responsibility over 
content than style or form, we can find very original literary works written 
under the term ‘compilation. Some of these compilations circulated under the 
name of an authoritative figure, such as Augustine, and were generally consid- 
ered to convey his thoughts, even if the actual content contained very little of 
his original works. Furthermore, scribes and secretaries were often employed 
in the actual composition of the final work, creating a further layer of stylistic 
authorship in a text? Due to these characteristics, the scholarship of medieval 
literature has for a long time recognised that the role of a compiler and even 
copyist was often comparable and at times surpassed that of an auctor.* Finally, 
many texts circulated anonymously or under an early modern misattribution. 

In this chapter, we discuss one such complicated case of medieval author- 
ship, an anti-heretical treatise known as the Refutatio errorum, written in the 
1390s in German-speaking Europe. It has many of the characteristics described 
above. It is of compilatory nature, containing passages from different sources, 
very few of which are named in the text itself. It has no prologue or comparable 
section, where someone would claim their authorship over the text. Instead, 
the whole treatise is very practical, intended to provide information, but not 
to flaunt with rhetorical abilities of its composer. For a long time, the treatise 
was considered anonymous, until R. Válimáki provided contentual, structural 
and codicological evidence linking the treatise to the inquisitor Petrus Zwicker, 
who also authored a more famous anti-heretical work entitled Cum dormirent 
homines. As is usually the case with reattribution of a medieval work, the con- 
clusions are not based on single evidence, but on a combination of mutually 
enforcing pieces of evidence (described below). The purpose of this chapter 
is to add a new element to the analysis: computational authorship attribution 
using a Support Vector Machine (SVM). We discuss the results of the computer 
classification in relation to qualitative analysis of the text. The aim is to find 
out if computational methods provide added value to conventional author- 
ship attribution of a medieval text. Or, could one claim that the computational 
methods are to be regarded as superior to qualitative interpretation by an 
expert human reader? 

Computational authorship attribution can be considered a sub-category of 
style-based document authentication (Echtheitskritik),° and the first attempts 
to apply computational methods in the attribution tasks of Latin literature were 
in the 1970s.’ After that, there was a lull in computational study of classical 
and medieval literature with a few exceptions, but since the late 1990s and 
especially in the past five years several studies have demonstrated that com- 
putational authorship attribution can be a powerful tool in the recognition of 
classical and medieval authors? As perhaps symptomatic to the whole field 
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of digital humanities, the first publications of this new wave of computational 
authorship studies have concentrated on developing the methodology itself, 
and results have been published mainly in digital humanities journals. At the 
same time, the attribution of new texts to classical and medieval authors goes 
on with little regard to the results of computational stylistics,'" and some recent 
publications even claim that statistical stylometry has fallen out of favour." 
Although such a claim betrays lacking knowledge of one’s research field, digital 
humanities scholars are not entirely blameless. Very few publications have tried 
to bridge the gap between discussions on authorship in the fields of literature 
studies and history, and in the computational linguistics respectively. Remark- 
able exceptions are Jeroen de Gussems recent article on trails of Nicholas of 
Montiéramey’s secretarial style in Bernard of Clairvaux's writings, as well as 
Mike Kestemont and colleagues’ study on collaborative authorship of Hilde- 
gard of Bingen and Guibert of Gembloux.'? 

Consequently, computational analysis can raise suspicion among humanities 
scholars trained in qualitative methods. Machine learning or other branches of 
computational text classification may appear as radically new ways of analysing 
sources that bypass the human expertise (and are therefore terrifying). This 
they, however, are not. Although utilising computational capacity and han- 
dling amounts of data that far surpass the abilities of any human individual, 
the computational authorship attribution uses stylistic features that have been 
long since recognised as marks of authorship. A. Mutzenbecher prepared a 
new edition of Maximus of Turin’s sermons at the beginning of the 1960s and 
defined 16 criteria (some with several sub-categories), which he divided into 
four slightly overlapping groups: (1) external evidence, (2) biblical quotations 
and their exposition, (3) style and (4) sources. Some of his criteria were pri- 
mary, some secondary. An authentic sermon had to fulfil two primary criteria 
and several secondary criteria." 

For the purposes of this chapter, it is not necessary to explain what all of these 
were. It is sufficient to note that many of Mutzenbecher’s criteria were purely 
qualitative, such as the theological topics Maximus typically discussed, but 
especially criteria for the introduction and exposition of the biblical citations 
(numbers 6 to 8) and criterion 13, linguistic-stylistic characteristics, include 
features that are similar to stylistic features used in computational authorship 
attribution: word uni- and bigrams formed of function words and other very 
common expressions (for example, enim, ex quo, hoc est, quanto magis, sed dicit, 
ego dico, mirum est). Mutzenbecher was well aware that these stylistic features 
appeared in almost all other authors in addition to Maximus, and that none 
of them could individually constitute authorship, but if several of them sup- 
port each other reciprocally, their relationship might express something typi- 
cal. Computational authorship attribution does precisely that: it uses features 
that appear in almost all authors, but with different emphasis. To put it simply: 
it is the combination of all the significant stylistic features in comparison to 
their combination in other authors that determines authorship. A computer, 
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however, is not limited to a few obvious stylistic features of an author, but can 
handle thousands and millions of these in a systematic and repeatable way. 


The Refutatio Errorum and Its Redactions 


The test case in this study is a text known as the Refutatio errorum. It is a polem- 
ical description of the Waldensians, a religious group persecuted as heretics 
by the Catholic Church in the Middle Ages and early modern period. In the 
1390s, a series of inquisitions and other trials were directed against the group in 
German-speaking Europe,'* and the Refutatio was written as part of the literary 
polemics accompanying the persecution. The treatise gives a view of Walden- 
sianism very similar to that of the better known polemical treatise against the 
Waldensians, Cum dormirent homines (henceforth, CDH), written by one of 
the most important inquisitors of the late 14th century, the Celestine provincial 
Petrus Zwicker. The Refutatio is clearly a representative of the same era and 
state of knowledge about the Waldensians as the CDH. It has been commented 
on by scholars much less than the CDH, quite likely because the only available 
printed version, edited by Jacob Gretser together with the CDH (1613/1677), 
is obviously incomplete. It has 10 chapters, but the text stops abruptly in the 
middle of the tenth chapter." 

Among the scholars, there has been confusion rather than actual disagree- 
ment about the Refutatio's authorship. For a long time, everyone was reluctant 
to make definite claims about its authors. In his groundbreaking studies on 
the CDH, P. Biller did not suggest any author or dating for the Refutatio, but 
seems to have held the view that the two treatises were not written by the same 
author, that is Zwicker. In fact, Biller uses the common manuscript tradition 
of the Refutatio and CDH as an argument against the attribution of the CDH 
to Peter von Pillichsdorf, the author suggested by Gretser in his 17th-century 
edition. The argument runs as follows: Gretser’s misattribution was based on 
the now lost Tegernsee manuscript, which included the CDH and a short anti- 
Waldensian treatise by Pillichsdorf, who is the only author mentioned in the 
manuscript. This consequently led Gretser to propose Pillichsdorf as the author 
of both these treatises treating the same topic. According to Biller, this is a 
parallel case to that of the several manuscripts, including the CDH and the 
Refutatio. These too were two different treatises on the same subject, but were 
treated as one by both medieval scribes and modern compilers of manuscript 
catalogues. Biller did not state anything explicit concerning the authorship of 
the Refutatio, calling it and Zwicker’s CDH only ‘two tracts on similar mate- 
rial? They do indeed cover very much the same material, and because of this P. 
Segl has tentatively proposed that these two treatises originated from the same 
hand.” E. Cameron describes the treatise very vaguely, but evidently treats it 
as a product of the 1390s, at one point calling it ‘a third treatise from Zwicker's 


Manuscripts, Qualitative Analysis and Features on Vectors 283 


circle.” A. Patschovsky has also associated the Refutatio loosely with Zwicker, 
without making any definite claims about its authorship.” In other words, there 
has been a vague suspicion that Petrus Zwicker, or someone close to him, wrote 
the treatise. 

To further complicate the study of this text, the only available printed edi- 
tions are based on a text that is anything but representative of the manuscript 
tradition of the Refutatio. As noted, Jacob Gretser printed the tract in the 17th 
century from a manuscript that ends abruptly in the middle of Chapter 10. I. 
von Döllinger's 19th-century edition from the same manuscripts does not help, 
but adds further confusion, as the order of the chapters is mixed in the edi- 
tion, and material not belonging to the Refutatio is inserted among the text.” 
An analysis of all the preserved 19 manuscripts of the work by Valimaki has 
demonstrated that the edited version of the texts does not concur with the main 
manuscript tradition, that is the most common and widely circulated medieval 
text. All in all, Välimäki found four different redactions of the Refutatio erro- 
rum. Of these, Redaction 1 is by far the most common, with 13 manuscripts. 
It is also the only redaction accompanying Zwicker's better known and more 
popular treatise, the Cum dormirent homines. 'The two texts appear together in 
eight manuscripts. In comparison, the text printed by Gretser in the 17th cen- 
tury is a late and incomplete redaction (Valimaki’s Redaction 4) represented by 
only two medieval manuscripts.? 

In addition to collation of the Refutatio’s manuscript tradition, Välimäki has 
also proposed that the treatise can be attributed to Petrus Zwicker. The two 
works present a very similar view on the Waldensians; they both follow simi- 
lar structure of polemical refutation by presenting heretical propositions and 
Catholic counter-arguments, mainly based on biblical quotations. The most 
important pieces of evidence for the common authorship are the sources cited 
in these two works. In the CDH, Zwicker quotes almost exclusively the Bible in 
support of his arguments. The single exception to the rule is a reference to 
Boethius Consolation of Philosophy. The same quote can be found in almost 
the exact same form in the Refutatio errorum. In addition, the author of the 
Refutatio had direct access to Moneta of Cremonas 13th-century anti-hereti- 
cal treatise Adversus Catharos et Valdenses. 'The treatise was very rare north of 
the Alps, but Petrus Zwicker used it when composing the CDH. The final rare 
source implying the authorship of Zwicker is a misquotation of Ezekiel 33.12 
in the Refutatio. The exact form of this quotation comes from the legal consul- 
tations on the case against the goldsmith Heynus Lugner in the late 1330s or 
early 1340s, transmitted in two manuscripts, a Bohemian inquisitors manual 
Linz MS 177, and another, St. Florian, MS XI 234, which is copied from the first 
manuscript. The Linz manual was once owned by Petrus Zwicker and the St. 
Florian manual was copied from his own inquisitor's manual. Ergo, the author 
of the Refutatio had access to a rare text, which has certain manuscript circula- 
tion only in connection to Petrus Zwicker.” 
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Texts for the Analysis and Pre-processing 


Next, we analyse the two most important redactions of the Refutatio with com- 
putational classification in order to verify Zwicker’s authorship. The redactions 
selected for the classification are the most common and longest Redaction 1 
and Redaction 4 representing the version in Gretser’s edition. The text of 
Redaction 4 is taken from a manuscript Augsburg, Universtitatsbiliothek MS 
338 (TEST1) as well as Gretser's edition (TEST2). Redaction 1 is transcribed 
from the manuscript Vienna, Osterreichische Nationalbibliothek (ONB), MS 
1588 (TEST3). All texts are long enough for a reliable authorship attribution, 
from around 5,500 words in TEST2 to over 9,000 words in TEST3. We excluded 
Redactions 2 and 3, both extant in a single manuscript and not close to the 
original text. Neither of these redactions is representative of the medieval or 
modern reception of the work. 

We trained the classifier with Petrus Zwicker’s CDH (around 23,000 words). 
The text we used comes from the same Gretser’s edition as one of the tested 
versions of the Refutatio. The reference corpus for training our classifier 
consisted of late ancient and medieval anti-heretical polemical treatises, which 
is the genre of both Zwicker’s CDH and the Refutatio. In total, this training 
data has around 600,000 words. The emphasis is on medieval texts, and the 
corpus includes three works that are almost contemporary to Zwicker’s texts: 
Wasmud von Homburgs Tractatus contra hereticos, an anonymous Attendite a 
falsis prophetis and the already mentioned Peter von Pillichsdorf's Contra Pau- 
peres de Lugduno. In addition, the most important source and stylistic model 
for Zwickers CDH, Moneta of Cremona's Adversus Catharos et Valdenses, is 
included. From Moneta's very long treatise, we selected only Book 5, where 
many of the anti-Waldensian arguments are presented. Alone, Book 5 has 
over 120,000 words, and including all 400,000 words from the whole treatise 
would have created an imbalanced reference corpus. The complete corpus with 
bibliographical information is in Appendix 16.1. The data is available at our 
GitHub page in masked form only to protect copyrights of recent editions used 
in the corpus.” 

The dataset we use is far from easy and common in authorship attribution 
tasks. It is a mixed corpus of different edition and transcription standards, 
which is a problem for feature selection. Even though character n-grams are 
widely used as features in text classification, recent computational studies on 
the authorship of classical and medieval texts have preferred lemma-level 
approach and function word analysis over character n-grams or plain text. 
This is partly due to the orthographical variation in medieval Latin. The effects 
of orthographical variation are more marked when the features used are a few 
dozen function words. However, as our classifications are based on a much 
more complex set of features, the effect of single ‘bad’ features for the end result 
is minimal. Using word uni- and bi-grams from plain text, as well as charac- 
ter n-grams, also has significant benefits in Latin. It gives access to stylistic 
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solutions below the word level, such as the author’s decision to use the subjunc- 
tive instead of the indicative." 

We solved the most common issues of different editorial principles and 
orthographical variation with simple normalisation rules: 


uv 

joi 

yoi 

aee 

0e > € 

char — car (to solve variation charitas vs. caritas) 

wa — va (to solve variation ewangelium / evangelium and waldenses / 
valdenses) 


These solve the majority of orthographical variation caused by editorial and 
scribal conventions and the differences of medieval and classical Latin without 
masking potentially significant stylistic features. In addition to orthographical 
normalisation, in the pre-processing phase we cleaned the texts from editorial 
additions such as page numbers and chapter titles (unless part of the original). 
Punctuation, numerals and single characters were removed. From early medi- 
eval texts, we naturally cleaned the references to bible books and verses (which 
were added by later editors), but in late medieval texts, most notably Zwicker's 
own treatise, these are part of the original and were thus preserved. The pre- 
processing was done automatically, but confirmed with sanity checks. 

However, the transcripts from medieval manuscripts have much more varia- 
tion than edited texts. While orthographical habits and grammatical mistakes 
of individuals are excellent stylistic features when one is dealing with auto- 
graphs, in medieval manuscript culture such variation is noise in data. We are 
usually not interested in writing conventions of an individual scribe, but those 
of the author or compiler of the work. Even the usual orthographical variation of 
late medieval manuscripts is challenging to normalise without also masking 
potentially significant stylistic features. Thus, in addition to solving the ques- 
tion of Zwicker's authorship, we experimented with the data in order to find a 
relatively effortless way to pre-process and analyse such a corpus with a com- 
puter. The expected results from our dataset are as follows: 


1. If the pre-processing and feature selection are able to overcome the 
orthographical challenges, all test cases of the Refutatio should be clas- 
sified in a similar way. We expect that they are classified as Zwicker's 
works together with the CDH (values over 0). 

2. All other works should get values below 0 in the classification. 

3. If Peter von Pillichsdorf’s treatise from Gretser’s edition is classified 
together with Zwicker's works, the early modern editorial solutions have 
more weight more than medieval authorship. 
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Computational Authorship Attribution: Methods 


The puzzle we set out to solve is: Did Petrus Zwicker write the Refutatio erro- 
rum? In authorship attribution, this is called a verification problem: we do not 
have a closed set of candidates, but one suspected author.” We constructed the 
verification problem as a simple binary classification, where Zwicker' treatise 
forms one class and all other authors in the training material a second class. 
The classifier was trained with this material, and the versions of the Refutatio 
were presented as a test case. We use the two corpora combined as training data 
for the classifier, while the test cases form the test data. The different redactions 
of the Refutatio are each treated as a separate test case. 

Here, we present an overview of the methods. For technical details and code, 
please consult our project repository.” For the classification, we use a linear 
SVM, which is a simple yet effective classifier and has traditionally been applied 
in text classification tasks.*! The SVM works by learning a weight for every 
feature from the training data, so as to maximise the decision margin between 
the two classes. The weight being positive or negative is an indication of which 
class the feature is potentially associated with, although one needs to exercise 
caution when comparing features in isolation based on their weight. The fea- 
tures we use with the SVM are word unigrams and bigrams. In other words, we 
train the classifier with the training data to recognise the features typical and 
atypical of Petrus Zwickers style. After that, the test cases are classified, 
and the output is a value indicating how much (positive) or how little (negative) 
the sum of weighted features in each test case resembles the class (Zwicker). 
The values are represented on a scale between -1 and 1. 

The value and the decision are largely useless in isolation if we cannot be 
certain that the classifications are valid overall. Here, we apply the standard 
technique of cross-validation using the training data, which provides us with 
an estimate of the classification accuracy and therefore the reliability of our 
results on the actual test documents.? 

The classifier we use is by nature undiscriminating when it comes to the 
features. It does not care which features are used, as long as they increase 
the training accuracy. In authorship attribution tasks, this would ideally be fea- 
tures that describe the author's way of writing, such as the usage of function 
words. Even within a single genre as in our training data, however, the particu- 
lar topic of each text affects the results. We run the classification to unmasked 
data, and among the 10 strongest positive features five included 'Waldensians' 
in some form.? A classification from such features is based only partly on an 
author's style, and the topic of the texts heavily distorts the results. Therefore, 
we must mask topic words so as to not let the classifier focus purely on the topic 
of the texts instead of the author's style. To this end, we calculated the thou- 
sand most common words in post-classical (Christian) Latin. Any word not 
in the calculated word list will be masked. This has been shown to drastically 
increase the accuracy of cross-genre classifications, as it forces the classifier to 
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learn author-specific rather than topic-specific features.” This method does not 
completely remove topic words, but it only leaves the ones that appear regularly 
across different genres. In the following, we concentrate on results from the 
classification in the masked data. 


Comparing the Results 


The classification from the SVM using masked data is presented in Table 16.1 
and in Figure 16.1. 

The results were both expected and unexpected. First of all, the classification 
confirms that also from a stylistic perspective Petrus Zwicker is the author of 
the Refutatio errorum. All redactions, whether transcripts from manuscripts 
or the text from Gretser’s edition, were classified as Zwicker’s texts with a clear 
margin to other works. The exception here is the short treatise Attendite a falsis 


Table 16.1: SVM Classification, masked data. 


TEST2 Refutatio errorum R4b (Gretser) 1.0 
Anon. Attendite a falsis prophetis 0.953 
Petrus Zwicker Cum dormirent homines 0.926 
TEST1 Refutatio errorum R4a 0.745 
TEST3 Refutatio errorum R1 0.662 
Durand of Huesca Liber antiheresis -0.062 
Anon. of Passau Tractatus de erroribus hereticorum -0.212 
Berthold von Regensburg | Sermones -0.254 
Wasmud von Homburg | Tractatus contra hereticos -0.352 
Anon. Disputatio inter Catholicum et Paterinum | -0.357 
hereticum 
Durand of Huesca Liber contra Manicheos -0.573 
Petrus de Pillichsdorf Contra Pauperes de Ludguno -0.574 
Moneta Cremonensis Adversus Catharos et Valdenses, Book 5 -0.717 
Alanus de Insulis Contra haereticos -0.807 
Petrus Veronensis Summa contra haereticos -0.854 
Johannes Cassianus De incarnatione Domini contra Nestorium | -0.889 
Hermannus de Scildis Tractatus contra haereticos negantes -0.953 
immunitatem Ecclesiae 
Augustinus Hipponensis | Contra Faustum Manichaeum -0.986 
Augustinus Hipponensis | Contra epistulam Fundamenti -1.0 


Source: Authors. 
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Figure 16.1: Green dots represent the test cases (Refutatio), red dots Zwicker's 
CDH and blue dots texts by other authors. Source: Authors 


prophetis, discussed below. But if we exclude it, all other works from the ref- 
erence corpus got values below 0, and Zwicker’s texts were neatly classified 
between 0.662 and 1.0. Not surprisingly, the text from Gretser's edition got the 
highest value (1.0), in fact higher than the CDH. This appears contradictory at 
first, but the explanation is simple: the classifier first learns the weight of fea- 
tures from the whole text, but in cross-validation the text is divided into slices 
of 1,000 words, and the final value is the average of all the slices. Some of these 
got values below 1, weighting down the average. In other words, the Refutatio's 
style is indistinguishable from Zwicker' style in the Cum dormirent homines in 
comparison to the reference corpus. 

After pre-processing and masking, the features on which the SVM bases 
its decision pass the sanity check. In Appendix 16.2, there is a list of the 50 
strongest positive and negative features. In both positive and negative class, 
these are function words or common content words, or bi-grams combining 
such common words with masked words. Among these, only one positive fea- 
ture ('imo' 6.344) results from orthographical variation (imo vs. immo). All 
in all, a classification based on these features can be deemed reliable and non- 
dependable from topics. 

The classifier was also able to distinguish authorial signature from both 
editions and manuscripts so that the editorial solutions or orthographical vari- 
ation do not completely distort an author's style. This is confirmed not only 
by the consistent classification of the different versions of the Refutatio, but 
also by the value acquired by Peter von Pillichsdorf’s Contra Pauperes de Lud- 
guno. Despite being a tract on the same topic (Waldensians) as the CDH and 
the Refutatio, and from the same edition (Gretser) as the CDH and TEST2, it 
got a clearly negative value of -0.574. Six other texts got values nearer to the 
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threshold, so Pillichsdorf’s tract is very far from Zwicker’s texts. The edition, of 
course, has an effect, as we can see from the very strong value TEST2 got. 

The unexpected result was the Attendite a falsis prophetis. It got a very high 
value (0.953), and in the classification in the unmasked data, not presented 
here in detail, the result was consistent (0.559). This cannot be explained by the 
same topic, as the extremely high value is based on masked data. How should 
we interpret this? Do we have a new text attributed to Petrus Zwicker? This is a 
possibility, but the SVM's classification must be considered against the histori- 
cal context, manuscript tradition and the contents of the text. 

First, very little in the contents of the text contradicts Zwicker' views in the 
Refutatio or the CDH. In fact, the Attendite presents similar Waldensian propo- 
sitions and Catholic counter-arguments to those of Zwicker. For example, the 
CDH, Refutatio and Attendite all begin by refuting the Waldensian claim of a 
legitimate lay ministry and proceed then to treat individual points of doctrine 
such as denial of Purgatory and oath-taking. P. Biller has already pointed out a 
certain similarity between the Attendite and the CDH.* There is a minor detail: 
the Attendite states that the Waldensians do not accept the books of Macca- 
bees as parts of the biblical canon.” In the CDH, Zwicker stays silent about 
this and in fact uses the Maccabees to prove that the intercession on behalf 
of the dead had its foundation in the Bible.” This small divergence, however, 
can be explained by the development of Zwicker’s argumentation. He desper- 
ately needed the Maccabees in order to maintain the principle of finding the 
foundation of Catholic doctrine and practices solely in the Bible, a principle 
that was only fully developed in his main work, the CDH. The author of the 
Attendite did not follow these guidelines: some of the arguments are supported 
by patristic quotes. Yet, this does not automatically deny Zwicker’s authorship. 
Although Zwicker got rid of extra-biblical quotes almost completely in writ- 
ing the CDH, he refers to patristic auctoritates several times in the Refutatio.” 
Solely based on the contents, the Attendite could be an early work of Petrus 
Zwicker. He was, after all, a man obsessed about the Waldensians and the threat 
they posed to the Church, and it is not out of the question that he wrote a third 
treatise against them. 

The main doubt comes from the dating of the work. This is remarkably dif- 
ficult, because the text is very general and does not refer to any specific persons 
or incidents. Nor does the author use any particular or rare sources. In princi- 
ple, any late medieval author with access to anti-heretical treatises commonly 
circulating in Central Europe could have written the text. There have been two 
propositions about the author, one obviously mistaken, and another probably 
due to confusion with another text. Based on one manuscript (Wroclaw, Uni- 
versity library MS I F 230), R. Cegna misdated the text to the year 1399 and 
misattributed it to the Silesian inquisitor Johannes of Gliwice.“ There are no 
grounds whatsoever for either the dating or the attribution," and a few manu- 
scripts predate the one used by Cegna. Older research attributes the treatise 
to the Bohemian reform preacher and troublemaker Conrad of Waldhausen, 
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which would date the text to the 1360s.” The attribution might have resulted 
from confusion of this short tract with Conrad’s sermon on the same Bible 
verse, given at some point in 1363 to 1369.? The manuscript transmission his- 
tory points to Austria, Southern Germany, Bohemia and Silesia. P. Biller has 
proposed that the earliest possible dated manuscript of the Attendite is St. Paul 
im Lavanttal, MS 71/4, which has the year 1373 at folio 160va, referring to the 
composition date of a copy of a polemical letter from a converted Austrian 
Waldensian to Lombardian Waldensian Brethren.“ Although the part with the 
Attendite (folios 144ra-146vb) belongs to the same fascicule with the letter, it is 
uncertain if 1373 is the production date of this particular exemplar. The man- 
uscript MS 71/4 is a compilation with fascicules produced at different times 
in the late 14th and early 15th centuries.* The dating can only be confirmed 
through codicological analysis of the physical object itself, which is not possible 
within this study. The more secure dating comes from Klosterneuburg, MS CC 
826, datable to 1391 and described by P. Biller. With absence of a systematic 
study on the manuscript circulation of the Attendite, this is the most credible 
terminus ante quem. It means that the geographical distribution and dating 
of the manuscripts overlaps with the beginning of Petrus Zwicker's career as 
inquisitor of heresy, not excluding his authorship. 

The final caveat comes from the credibility of the attribution itself. The text 
is only around 2,500 words long, making the attribution unreliable, as we are 
dealing with data with noise. In addition, the Attendite and the CDH (which 
is the material we used to train the computer for the class Zwicker) quote the 
same Bible verses. Although the quotations are not word-to-word identical, 
there is shared material in these two works. In the attribution of such a short 
text, it necessarily has an impact. Finally, we used a version of the text from a 
single manuscript, which we had in machine-readable format. There is a critical 
edition of the text by R. Cegna, but it too is mainly based on a single manu- 
script with variant readings in endnotes." The final attribution of the Attendite 
is only possible when further study reveals the earliest redaction of the text and 
the manuscript dates are confirmed. From the earlier proposed authors, texts 
from Conrad of Waldhausen must be included in the classification as a pos- 
sible author. At this point, we must be content to say that the Attendite a falsis 
prophetis is possibly attributable to Petrus Zwicker, but the attribution needs 
corroborating evidence from the manuscript tradition. 


Conclusion: Additional Value of the Computational Analysis? 


In the future, the computational authorship attribution should be taken into 
the toolbox of historians and philologists, who work with anonymous, pseudo 
and dubious texts. The classifiers developed for the analysis of modern litera- 
ture or forensic purposes have been proved to be effective also in the study of 
ancient and medieval texts. 
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In our case study, the authorship of the Refutatio errorum, the computational 
methods produced both corroborating evidence and expected results, as well as 
radically new insights. The authorship of the Refutatio was confirmed as Petrus 
Zwicker through computational stylistics. Although there were previous, con- 
vincing pieces of evidence in support of this, the analysis is not without added 
value. A computer's decision is based on a completely different set of features 
than contents analysis and contextual evidence presented in the previous studies. 
Another important result was the classification of Peter von Pillichsdorf’s trea- 
tise as clearly non-Zwicker. This not only confirms the earlier qualitative attribu- 
tion, but demonstrates that our classifier can bypass the stylistic conventions of 
an early modern editor and detect the medieval author signature below. 

The greatest added value of computational authorship attributions comes, 
however, from the unexpected results, from texts behaving in an anomalous 
way. In this classification, the Attendite a falsis prophetis did precisely this. Up 
until this point, nobody has really considered Zwicker's authorship, because 
the manuscript tradition points to a somewhat earlier treatise. Yet, when the 
classification gave a strong attribution to Zwicker, it forced us to reconsider 
the qualitative evidence. This, in turn, was revealed to be indecisive as well. 
Although we are not ready to declare the case closed and a new text attributed 
to Zwicker, the example demonstrates the true power of computational meth- 
ods: it breaks the existing patterns of thought and demands re-evaluation of 
previous presuppositions. 

Our chapter demonstrates that computational history cannot progress in iso- 
lation from the more conventional study of history, particularly the very basic 
archival study of sources. The attribution of the Attendite a falsis prophetis is to 
remain ambiguous until the existing manuscripts are surveyed in detail. The 
study of history depends on source criticism, and in order to date, attribute and 
localise sources with digital methods we have to take care that our metadata is 
up to standard. 
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Appendix 16.1: Text Corpus 


TEST]: Refutatio errorum, Redaction 4a 
Source: Augsburg, Staats- und Stadtbibliothek, MS 338, fols. 159r- 170r. 


TEST2: Refutatio errorum, Redaction 4b 
Source: Gretser, J. (Ed.). (1677). Lucae Tvdensis episcopi, Scriptores aliqvot 
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svccedanei contra sectam waldensivm. Maxima Bibliotheca Veterum Patrum, 
Et Antiquorum Scriptorum Ecclesiasticorum. Tom. XXV. Lvgdvni: Anisso- 
nios, 302G-307F. 


TEST3: Refutatio errorum, Redaction 1 
Source: Vienna, Osterreichische Nationalbibliothek, MS 1588, fols. 191r-211v. 


TRAINING DATA 

Suspected author: Petrus Zwicker: 

Text: Cum dormirent homines (CDH) 

Source: Gretser, J. (Ed.). (1677). Lucae Tvdensis episcopi, Scriptores aliqvot svc- 
cedanei contra sectam waldensivm. Maxima Bibliotheca Veterum Patrum, Et 
Antiquorum Scriptorum Ecclesiasticorum. Tom. XXV. Lvgdvni: Anissonios, 
277F-299G. 


Other authors: 

Author: Alanus de Insulis (Alain of Lille) 

Text: Contra haereticos 

Source: Patrologia Latina 210. Text from Corpus Corporum: http://mlat.uzh 
.ch/?c=2&w=AlDelIn.ConHae 


Author: Anonymous 
Text: Attendite a falsis prophetis 
Source: St. Florian, MS XI 152, fols. 48v-50v. 


Author: Anonymous 

Text: Disputatio inter Catholicum et Paterinum hereticum 

Source: Hoécker, C. (Ed.). (2001). Disputatio inter catholicum et paterinum 
hereticum: die Auseinandersetzung der katholischen Kirche mit den italienis- 
chen Katharern im Spiegel einer kontroverstheologischen Streitschrift des 13. 

Jahrhunderts. Tavarnuzze (Firenze): SISMEL edizioni del Galluzzo, 3-80. 


Author: Anonymous of Passau 

Text: Tractatus de erroribus hereticorum 

Source: Nickson, M. A. E. (1962). A critical edition of the treatise on heresy 
ascribed to Pseudo-Reinerius, with an historical introduction. Queen Mary, 
University of London, 1-154. 


Author: Augustinus of Hippo 

Text: Contra Faustum Manichaeum 

Source: Patrologia Latina 42. Text from Corpus Corporum: http://mlat.uzh 
.ch/?c=2&w=AugHip.CoFaMa 


Author: Augustinus of Hippo 
Text: Contra epistulam Fundamenti 
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Source: Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL) 25.1. Text 
from Corpus corporum: 
http://mlat.uzh.ch/?c=19&w=August.CoEpFunCSEL 


Author: Berthold von Regensburg 

Text: Sermones [XXIIII, XXVIII, XXVUIL ‘Sancti pre Fidem and ‘Dominica 
Duodecima | 

Source: Czerwon, A. (2011). Predigt gegen Ketzer: Studien zu den lateinischen 
Sermones Bertholds von Regensburg. Tübingen: Mohr Siebeck, 203-233. 


Author: Durand of Huesca 

Text: Liber contra manicheos 

Source: Thouzellier, C. (1964). Une somme anti-cathare: le Liber contra 
Manicheos de Durand de Huesca. Louvain: Spicilegium sacrum Lovaniense, 
67-336. 


Author: Durand of Huesca 

Text: Liber Antiheresis 

Source: Selge, K.-V. (Ed.) (1967). Die ersten Waldenser: mit Edition des Liber 
antiheresis des Durandus von Osca (Vol. 2). Berlin: De Gruyter, 3-257. 


Author: Hermannus of Scildis 

Text: Tractatus contra haereticos 

Source: Zumkeller, A. (1970). Hermanni de Scildis O.S.A.: tractatus contra 
haereticos negantes immunitatem et iurisdictionem sanctae Ecclesiae et 
Tractatus de conceptione gloriosae virginis Mariae. Würzburg: Augustinus- Verl., 
3-108. 


Author: Johannes Cassianus 

Text: De incarnatione Domini contra Nestorium 

Source: Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL) 17. Text 
from Corpus Corporum: http://mlat.uzh.ch/?c=19&w=Cassia.ConNesCSEL 


Author: Moneta Cremonensis (Moneta of Cremona) 

Text: Adversus Catharos et Valdenses, Liber V 

Source: Moneta (Cremonensis). (1743). Monetae Adversus Catharos et Val- 
denses: libri quinque. T. A. Ricchini (Ed.). Romae: Ex Typographia Palladis, 
389-560. 


Author: Petrus de Pillichsdorf (Peter von Pillichsdorf) 

Text: Contra Pauperes de Lugduno 

Source: Gretser, J. (Ed.) (1677). Lucae Tvdensis episcopi, Scriptores aliqvot svc- 
cedanei contra sectam waldensivm. Maxima Bibliotheca Veterum Patrum, Et 
Antiquorum Scriptorum Ecclesiasticorum. Tom. XXV. Lvgdvni: Anissonios, 
299E-302F. 


294 Digital Histories 


Author: Petrus Veronensis (?) 

Text: Summa contra haereticos 

Source: Kaeppeli, T. (1947). Une somme contre les hérétiques de S. Pierre 
Martyr (?). Archivum Fratrum Praedicatorum, 17, 295-335 


Author: Wasmud von Homburg 

Text: Tractatus contra hereticos Beckardos, Lulhardos et Swestriones 

Source: Schmidt, A. (Ed.) (1962). Tractatus contra hereticos Beckardos, Lul- 
hardos et Swestriones des Wasmud von Homburg. Archiv für mittelrheinische 
Kirchengeschichte, 14, 336-386. 


Appendix 16.2: The 50 Strongest Positive and Negative 
Features from the SVM Classification on Masked Data 


Positive features: 

tv XXXXXXXXX 6.895 
imo 6.344 
xxx item 5.824 
sed dicis 5.234 
xxx dixit 5.04 

item xxxx 4.864 
item xxxxx 4.389 
item xxx 4.385 
XXXXXXXX imo 3.262 
semper XXXXX 3.081 
nostri XXXX 2.793 
non solvm 2.562 
ecce XXXXXXXXX 2.517 
XXXXXX ecce 2.459 
svvm XXXXX 2.225 
sanctorvm dei 1.919 
dixit Xxxxxx 1.833 
dicis xxxxxxxx 1.729 
dicentes xxxxxx 1.691 
dicis 1.648 
XXXXXX item 1.608 
XXXXXXX item 1.493 
ecce 1.435 
XXXXX mea 1.397 
adhvc in 1.346 
item Xxxxxx 1.303 


vbi XXXXXXXXX 1.295 
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item XXXXXXXXX 1.287 
nec qvidem 1.285 
solvm 1.235 
qvod angeli 1.143 
sibi ipsis 1.129 
domine xxxxx 0.953 
mevm item 0.893 
domini nostri 0.819 
habes 0.812 
discipvli 0.752 
xxxxx domine 0.746 
dominvs XXXXXXXX 0.744 
nvnqvam 0.742 
qvi venit 0.729 
noster XXXXx 0.692 
tva 0.672 
vt videlicet 0.658 
privs per 0.65 
velvt 0.634 
xxxxx nolite 0.614 
habere xxxxxxxx 0.614 
XXXXX ecce 0.606 
XXXXX pro 0.595 
Negative features: 
ait -11.246 
apostolvs -6.224 
tantvm -5.863 
nec -5.201 
vt -3.964 
ei -2.957 
idest -2.747 
XXXXXXXX vt -2.545 
hvivs -2.221 
enim -2.133 
XXXXXXXXX Vt -1.95 
qvod -1.93 
XXXXXXXXXXXXX XXXXXXXX -1.705 
deo -1.642 
ac -1.619 
qvomodo -1.477 


de -1.415 
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dicitvr -1.334 
dicit -1.321 
nvllvs -1.277 
hoc -1.275 
est -1.251 
libro -1.198 
dicvnt -1.187 
qva -1.177 
cavsa -1.173 
Xxxxxxx qvod -1.166 
si avtem -1.125 
XX -1.105 
secvndvm -1.066 
se -1.052 
sic -0.999 
ista -0.929 
non est -0.905 
dictvm 0.894 
facit -0.825 
XXXXXX qvia -0.813 
ab -0.787 
si -0.739 
XXXXXXXXXX -0.718 
XXXXXXXX -0.714 
potivs -0.69 

carnem —0.684 
iterum -0.677 
XXXXXXXXX —0.661 
et -0.651 
qvod xxxxxxx -0.63 

legitvr -0.629 
aliqva -0.629 
XXXXXX —0.623 

Notes 


! See, e.g., Levy 2012: 23-24. 

? Minnis 1988: 192-193, 196 and passim. 

? Kestemont, Moens & Deploige 2015; De Gussem 2017. 

* See, e.g., Johnson 1991; Williams-Krapp 2000; Minnis 2006; Conti 2012. 

> Välimäki 2016: 45-76; Välimäki 2019: 38-39, 48-49, 56-58, 61-64, 102. 

$ Stover & Kestemont 2016a: 144. 

7 Marriott 1979; one should also remember the pioneering work on statistical 
stylistics that precedes the computer era, Yule 1944. 
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$ See, e.g., Clark 1987. 

? Gurney & Gurney 1998; Tse, Tweedie & Frischer 1998; Tweedie, Holmes 
& Corns 1998; Forsyth, Holmes & Tse 1999; Kestemont 2012; Kestemont, 
Moens & Deploige 2015; Kestemont et al. 2016; Stover & Kestemont 2016a; 
Stover & Kestemont 2016b; De Gussem 2017. 

10 See, e.g., Weidmann 2015. 
1 Adams 2016: 202. 
? Kestemont, Moens & Deploige 2015; De Gussem 2017. 
? Mutzenbecher 1961: 202-219. 
14 Ibid.: 203-204, 207-209. 
5 Ibid.: 202. 
'© Kolpacoff 2000: 247-261; Modestin 2007: 1-12; Välimäki 2019: 30-37. 
Gretser 1677. 
Biller 2001: 252-253. 
Segl 2006: 185 n. 102. 
Cameron 2000: 140, 142-143. 
Patschovsky 1979: 27 n. 42. 
Döllinger 1890: 331-344. 
Valimaki 2019: 39-48. 
Välimäki 2016: 57-76; Välimäki 2019: 38-39, 48-49, 56-58, 61-64, 102. 
See https://github.com/propreau/zwicker. 
Kestemont, Moens & Deploige 2015; De Gussem 2017. 
? Cf. Kestemont, Moens & Deploige 2015: 9-10. 
? Such as variations -ci-/-ti- ; -mq-/-nq- ; -dq-/-cq- ; -mp- / -mn-, which 
cannot be normalised with simple replace rules without losing significant 
features in the text. 
Koppel, Schler & Argamon 2009; Koppel et al. 2012. 
See https://github.com/avjves/AuthAttHelper. 
Cortes & Vapnik 1995; Chang & Lin 2011. In particular, we used the scikit- 
learn implementation of SVM with L2 penalty and squared hinge as loss. 
The C-parameter of the classifier was set using cross-validation so as to 
avoid overfitting on the test data. 
In cross-validation, we only focus on our training data, ignoring the actual 
test texts. We remove one document at a time from the training data and 
consider it as a new test case. Our current training data now consists of all 
texts but the new test case, and using it we subsequently train the classifier 
and let it give a class and a value for the new test case. Since we know the 
actual authors of the texts included in the training data, these results show 
how accurately the classifier classifies data which it has not seen. 

‘heretici valdenses; ‘valdensis; 'valdensis heretice; ‘valdensivm ‘tv 

valdensis. 

We calculated the word list from a corpus of 15 million words compiled for 

an attribution task of Augustine of Hippos works. It is available at https:// 

github.com/propreau/zwicker. 

3 Stamatatos 2017. 
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36 Biller 1974: 365. 

? St. Florian, MS XI 152, fol. 49v. 

38 Zwicker 1677: 288D-288E. 

?? Välimäki 2016: 77-114; Välimäki 2019: 61-62, 68-71, 73-85, 90, 94-98, 
102-103. 

^ Cegna 1982. 

^' Patschovsky 1994: n. 15. 

? Bartoš 1932: 32-33; Molnar 1989: 158 n. 29. 

^ Cf. Patschovsky 1979: 125-126. 

“4 Biller 1974: 221. Biller cites MS 92/4, fol. 161va, but it is a mistake, the man- 
uscript in question is MS 71/4. For the best overview of the manuscript 
tradition, see Biller 1974: 365-366. 

^ Glaßner 2002: Cod. 71/4 (olim 28.4.23). 

4 Biller 1974: 216-217, 365. 

^ Cegna 1982: 53-65. 
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CHAPTER 17 


Macroscoping the Sun of Socialism 


Distant Readings of Temporality in 
Finnish Labour Newspapers, 1895-1917 


Risto Turunen 


Introduction 


But in spite of all indifference, the sun of socialism has cast its first rays 
there. Even there, great and clear thoughts on the injustices of the pre- 
sent system are silently smouldering.' 


The optimistic quote above was written in 1903, by a labour journalist outlin- 
ing the preconditions of socialism in the eastern periphery of the Grand Duchy 
of Finland. Characteristic of the socialist discourse of the time, he used the 
phrase ‘the sun of socialism. It was one of the most important symbols of 
the Finnish labour movement in the early 20th century, figuring not only in 
newspapers, but also in poetry and red banners. Without doubt, there was 
something in the red sun of socialism that captured the contemporary prole- 
tarian imagination. 

Many studies in social and cultural history have proven that symbols acting 
as ‘simplified objectifications of ideologies’ play a crucial role in the making 
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of political movements.’ The sun is the starting point for this chapter, for we 
believe that this symbol carries rich temporal information from a century ago. 
Thus, it can be used as a symbolic key to unlock socialist perceptions of the 
imagined past, present and future. The breakthrough of Finnish socialism has 
been analysed from a variety of perspectives,’ but the focus has not been on 
‘temporality, that is, the way human beings experience time. There are some 
occasional comments on the socialist temporality in the previous research, 
mainly concentrating on the Marxian interpretation of history or on the future 
expectations in the form of socialist utopianism and eschatology.’ The third 
dimension of time, the present, has largely escaped scholarly attention. For 
example, the sun of socialism has been seen in the context of the future, as a 
symbol for a better tomorrow and freedom.? The future-oriented meaning cer- 
tainly existed, but we can add more interpretative depth to the investigation of 
the sun by also including the present in our analysis. 

According to Reinhart Kosellecks thesis on temporality, the emergence 
of modernity, especially the unexpected rupture of the French Revolution of 
1789, diminished the value of experience in forecasting the future While 
Koselleck’s argument concerned the German-speaking world, we argue that 
the General Strike of 1905 had a similar effect in the Finnish context, expand- 
ing the gap between the experiences (of the past) and the expectations (towards 
thefuture) and, simultaneously, creating a new understanding ofthe present. The 
General Strike from 30 October to 6 November in 1905 was not only a 
direct result but rather an active extension of the 1905 Russian Revolution 
to the Grand Duchy Finland.’ For the first time in Finnish history, workers 
momentarily seized a great part of political power, and this brief moment, a 
mere one week of imagined proletarian rule, meant that neither the old rules of 
politics nor old temporalities applied to the new situation. The General Strike 
led to a set of parliamentary reforms and to universal suffrage in 1906, and 
finally in 1907, just four years after the quote at the beginning of this chap- 
ter, Finland had the largest socialist party with parliamentary representation 
in Europe? 

This chapter has a threefold goal. First, regarding historical content, it con- 
stitutes a case study that tries to decipher the intriguing symbol of the rising 
sun and, thus, to broaden our understanding of the socialist temporality in 
Finland. The focus lies on the relation between the sun and the present, or 
more precisely, on how the sun illuminates the proletarian perception of their 
reality at the turn of the century. Second, methodologically speaking, we intro- 
duce ‘macroscopic’ approaches that allow historians to see something in the 
sources that is unavailable to the naked eye.’ In practice, this means quantifying 
comparable word frequencies, collocates and key collocates. Third, we describe 
what it means to write digital history, by sketching a simple theoretical model, 
which sheds a new light on the intellectual journey the scholar undertakes on 
her way from original sources to historical wisdom. 
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Relative Word Frequencies: Counting the Heartbeats 
of Finnish Politics? 


We begin our journey to the core of the socialist sun with an already 
well-established practice in digital history, that is, counting relative word 
frequencies over time. First, we download the dataset from the National Library 
of Finland: the raw text files of the biggest socialist (Työmies, “The Working 
Man’), conservative-nationalist (Uusi Suometar, ‘New Finland’) and liberal- 
nationalist (Helsingin Sanomat, ‘Helsinki News, and before 1904 Päivälehti, ‘The 
Daily Paper) newspapers from 1900 to 1917.'° Then, we find the words refer- 
ring to the present in each year by using the search string 'nyky*; which cov- 
ers the most common Finnish words denoting the present moment: ‘nykyinen’ 
(‘present’ as an adjective), nykyisyys" (present! as a noun) and ‘nykyisin / 
"nykyään (adverbs for the present moment).'' 

Figure 17.1 shows a trend. The socialist newspaper Työmies has the high- 
est frequency of ‘the present’ in 1900, but by the year 1904 the references to 
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Figure 17.1: The relative amount of the present (myky*) in three newspapers, 
1900-1917. Source: Raw text files of Työmies, Uusi Suometar, and Helsingin 
Sanomat / Päivälehti, distributed by the National Library of Finland, https:// 
digi.kansalliskirjasto.fi/opendata. 
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the present have dropped far below the two bourgeois newspapers. What 
does this trend, this piece of information, mean? Based on previous research, 
censorship might lead us to the correct explanation. According to Antti Kujala, 
the censorship of the labour press in Finland became considerably tighter in 
1903 when the Finnish Social Democratic Party officially adopted Marxism. 
Bans and warnings succeeded in silencing the most radical socialist discourse, 
for during the next year political commentary disappeared 'almost completely' 
from Tyómies.? However, the General Strike of 1905 broke the silence on politi- 
cal matters as preventive censorship ended temporarily. 

The renowned Finnish socialist poet Kóssi Kaatra summarised the dramatic 
new temporality summoned by the strike in his poem Suurina päivänä (‘Dur- 
ing the Great Days’) from 1906: 


It is great to be alive, 
when in a single day, in a night 
now we create more new things than in the work of many centuries." 


Reading Kaatra’s words and focusing especially on the temporal marker ‘now, 
it is not a surprise that we see a sharp rise in the socialist present, especially 
during 1907, which happens to be the year when Finland held the first par- 
liamentary elections. Could the new political situation (electoral speculation, 
campaigns and aftermath) explain the peak of 1907? Based on both close and 
distant reading of Työmies, this seems to be the case. The words such as ‘strike; 
‘government, ‘nation, land; ‘Duma and ‘senate’ increase greatly in close prox- 
imity to the present after the General Strike. Thus, the rise of the present 
means, in fact, the rise of the political present. 

We could explain this finding in the light of Benedict Andersons theory of 
‘imagined communities which argues that between 1500-1800 technological 
innovations and the advance of print-capitalism profoundly changed our expe- 
rience of time and space.” In the case of Finnish working people, these changes 
probably took place much later, beginning approximately from the mid-19th 
century onwards.'* When looking at the date on the front page of the daily 
socialist newspaper, a Finnish worker could see with her own eyes that time 
was moving linearly forwards day after day. In addition, she could imagine that 
meanwhile there were thousands of other workers like just like her reading the 
very same edition, although she had never actually met them." Using Ander- 
sons theory to explain the dissemination of socialism instead of nationalism, 
as it has usually been applied, guides our analysis towards the close connection 
between temporalities and print media, or in our case, between the socialist 
interpretation of the present and the Finnish labour press. Because tempo- 
ralities are always constructed, they can be manipulated. The leading socialist 
newspaper reacted to the changing political conditions after the General Strike 
by accelerating the flow of time, by repeating an imperative temporal message: 
the time to act is now. 


Macroscoping the Sun of Socialism 307 


One should not forget that there might be other alternative explanations 
for the peak of 1907. For example, there could have been more adverts in the 
socialist newspaper in 1907 than before, as the adverts of the early 20th cen- 
tury often referred to the present in order to sell their products better. The lack 
of information on what constitutes the peaks and valleys of word frequencies 
is not a trivial problem, but rather characteristic of word frequency charts in 
digital humanities. Far too often, they neglect variation inside a given corpus. 
Figure 17.1 is slightly better than the usual combination of a relative word fre- 
quency (y-axis) and time (x-axis) in the sense that it contains the extra dimen- 
sion of political affiliation. However, the figure would be even better if it showed 
the frequency of ‘the present’ in different newspaper genres (editorials, foreign 
section, adverts, poems, letters to the editor, etc.) for each newspaper under 
investigation. The distribution of genres would show us in which journalistic 
context the present is discussed in each major political language of the time. 

Despite the weaknesses, simple word frequency charts can reveal useful, low- 
level information to historians. In this case, it revealed above all that the amount 
of the socialist present varies strongly over time. The valley of 1904 is probably 
due to censorship, whereas the peak of 1907 is explained by the heated political 
situation. However, the general trend is that all the newspapers increase their 
references to the present with the passage of time. Does the trend reflect the 
increasing heartbeat of Finnish politics, or the rise of present-intensive adver- 
tising in all newspapers, or perhaps something completely different? We do 
not want to get entangled in that question in the context of this chapter, but we 
do want to highlight the importance of keeping an open mind when attaching 
meanings to the figures. As doctors know, if the heart beats faster than normal, 
the possible causes are many and varied. 


Collocation: Mining the Semantic Structure of 
the Socialist Present 


Historians inspired by conceptual history, discourse studies or the Cambridge 
school of intellectual history have for long been interested in the linguistic 
contexts in which their historical objects of interest (concepts, discourses, 
intellects) figure.? Nowadays, it is possible to quantify such linguistic contexts, 
given that the textual sources are in a machine-readable form. One approach 
to operationalise ‘the linguistic context’ is to define the context as all the words 
appearing in a window of x words to the left or right of the studied word. Since 
we are dealing with a highly inflected language, Finnish, it is important to lem- 
matise all the words in the text files, that is, to replace all word variations with 
their base form, before the actual analysis in order to get more reliable results." 
In our case study, we could quantify all the words that exist in proximity of 
five words from the words referring to the present in the three biggest social- 
ist newspapers. Why five words? There is no magic formula for defining the 
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Table 17.1: The most frequent words in a window of five words from the 'pre- 
sent’ (search strings 'nykyi* and 'nykyä”) in three socialist newspapers, 
1895-1917. 


RANK WORD TRANSLATION FREQUENCY 
1 olla to be 60,941 
2 ja and 25,893 
3 se it, that 19,693 
4 ettà that 14,741 
5 joka which 14,673 
6 ei no 12,174 
7 tama this 8,618 
8 ne they 6,348 
9 kun when 4,779 

10 saada to get 4,721 


Source: Lemmatised raw text files of Työmies, Kansan Lehti and Länsisuomen 
Tyómies / Sosialisti, 1895-1917, distributed by the National Library of 
Finland, https://digi.kansalliskirjasto.fi/opendata. 


perfect window size. Historians must decide, usually through trial and error, 
which is the most appropriate selection for their own research guestions. 

As we can see in Table 17.1, the problem with this approach is that the most 
freguent words connected to the present are common words which do not 
reveal anything relevant from the historian's perspective. Fortunately, corpus 
linguists have developed a statistically more sophisticated method in colloca- 
tion analysis that produces more meaningful raw information for historians to 
contemplate. Collocates are words that appear more freguently than statistically 
expected in close proximity to the search word. 

When looking at Table 17.2, one can immediately see that it contains use- 
ful information inviting a further human analysis. After close reading of the 
concordances, the list of examples of ‘the present’ as they occur in the socialist 
newspaper texts, we found three categories of technical errors: some collocates 
had, not surprisingly, OCR errors; some were lemmatised into a wrong base 
form; and some suffered from an incorrect word segmentation. We also 
increased the minimal freguency of collocation to a relatively high cut-off point 
of 200 instances, in order to filter out advertisements that plague all guantitative 
analyses of the Finnish newspaper corpus.? Then, after cleaning up Table 17.2 
for errors and function words, we created a simple visualisation that is hope- 
fully easier to understand for most historians.” Apart from absolute frequen- 
cies and Finnish originals, Figure 17.2 contains the same information as that in 
Table 17.2, but in a more accessible and user-friendly form. Figure 17.2 shows 
what could be poetically defined as ‘the architecture of the concept.” It is based 
on the principle that the human brain can intuitively understand: the closer the 
word is to the centre, the more strongly it is connected to the socialist present. 
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Table 17.2: The collocates of the present ('nykyä; ‘nykyi*’) in three socialist 
newspapers, 1895-1917, min. frequency 200. 


RANK WORD | TRANSLATION FREOUENCY | MI-VALUE 
i yhteiskuntajär- | "m 354 10.29 
jestelma 
2 tilanne situation 1,460 9.91 
jarjestelma system 1,506 8.58 
4 mallita to prevail (OCR 
error) 

5 vallita | to prevail 1,464 8.14 
6 oleskella | to stay 404 8.11 
7 yhteiskunta | society 2,026 8.11 
8 tätä | this 282 7.56 
9 olosuhteet | conditions 488 7.35 
10 kapitalistinen | capitalist 394 7.32 
11 kallis | expensive 587 6.95 
12 pula | shortage 293 6.87 
13 kurja | miserable 322 6.86 
14 politiikka — | politics 299 6.84 
15 valtiollinen | state- 500 6.79 
16 työttömyys | unemployment 452 6.76 
17 muoto | form 801 6.68 
18 asema | position, condition 3,547 6.64 
19 olo | condition 1,731 6.63 
20 kanta | view 897 6.56 
21 kunnallinen | municipal 487 6.55 
22 sota | war 1,415 6.55 
23 taloudellinen | economic 658 6.53 
24 vaikea | hard, difficult 295 6.52 

25 käytäntö | practice 

26 kehitys progress, 

development 
27 kurjuus misery 280 6.42 
28 säilyttää to preserve, to save 323 6.39 
29 epäkohta | grievance, 307 6.36 
shortcoming 

30 mahdoton | impossible 485 6.35 


Source: Lemmatised raw text files of Työmies, Kansan Lehti and Länsisuomen 
Työmies / Sosialisti, 1895-1917, distributed by the National Library of Finland, 
https://digi.kansalliskirjasto.fi/opendata. 
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The strength of collocation (MI-value: min. 6, max. 11) 
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Figure 17.2: The collocates of the present ('nykyi*; ‘nykya*’) in three social- 
ist newspapers, 1895-1917. Source: Raw text files of Työmies, Kansan Lehti 
and Länsisuomen Työmies / Sosialisti, 1895-1917, distributed by the National 
Library of Finland, https://digi.kansalliskirjasto.fi/opendata. 


It is relatively easy to find patterns in the figure for a historian with prior 
knowledge on the topic. First, the socialist present seems to attract phenomena 
that are considered to be negative, if not universally, at least as perceived by 
most people. In addition to the abstract concept of misery, the readers of 
the socialist newspapers were frequently introduced with the more concrete 
evils of ‘shortage, ‘unemployment’ and ‘war. Negativity was enforced with the 
adjectives ‘miserable’ and ‘hard, difficult, especially when talking about current 
‘conditions. A close reading also reveals that current ‘politics, referring to both 
tsarist repression and domestic bourgeois oppression, belong to the same nega- 
tive semantic field.” The critique of ‘politics’ is understandable since the social- 
ists, despite polling the most votes in every election, were not represented in the 
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government until in the year 1917.” Of course, in the socialist understanding, 
their interpretation of the present was not negative in an exaggerated sense, but 
rather a realistic portrayal of the inherent problems in capitalism. 

This leads us to the second feature worth highlighting in Figure 17.2: the 
misery of the socialist present was not accidental, but systematic. The words 
‘societal system; ‘system’ and ‘society’ in the figure hint at this socialist pattern 
of conceptualising the present. The everyday had a structure, and those great 
‘capitalist; ‘state, ‘economic’ and ‘municipal’ forces shaping the present were not 
mystical nor divine, but explainable through a rational analysis. Let us take a 
quote from each of the three main socialist newspapers to illustrate the logic: 


The eyes of many workers have opened even here [in Eura] to see the 
misery into which the present system has brought our society. 


The curse of the present system is precisely that the more satisfied the 
capitalists can be with their lives, the more miserable the life of 
the workers has become under the constraining conditions." 


Everyday experience shows us that it is not possible to achieve sufficient 
improvements for the condition of the majority of the people on the 
grounds of the present bourgeois system ...?* 


Thus, socialists not only disapproved of the present with negative words, but 
they also tried to explain the causal mechanism behind it, by arguing that the 
present system was the root of all evil. 

Although the present was in essence systematically bad, there was hope, or as 
the grand old man of Finnish labour history Hannu Soikkanen has argued in 
his seminal study Arrival of Socialism in Finland (1961), 'the present and future 
conditions were contrasted as starkly as possible.” According to Soikkanen, 
this contrast was one of the main features of socialism that made it psycho- 
logically so attractive for the working people. The third pattern in Figure 17.2, 
visible to an experienced eye, refers to this connection between the present and 
the future, that is, to the words that imply changes and movement. “The present 
situation indicates a way of thinking that is not limited by eternal conditions 
created by God in the beginning of time. In addition, the phrase ‘the prevailing 
conditions’ is used regularly in the political language of Finnish socialism, in 
order to undermine the foundations of the present status quo: 


Modern socialism is thus a product of the prevailing economic and 
social conditions in the present.” 


Class struggle is rooted in the unsolvable conflicts between employers 
and employees. A collective agreement cannot remove the conflict, and 
neither can it abolish the hegemony of the capital over work in the pre- 
vailing conditions of the present.?! 
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Capital is accumulating in the hands of fewer and fewer people while 
the propertied class is not growing. Marx taught that the natural result 
is that the conditions themselves must change, there will be a fall of the 
prevailing system.” 


Finally, the concept of ‘progress’ is fundamental to understanding the social- 
ist temporality, for it tied together the past, present and future. If the present 
capitalist society was indeed a temporary product of a historical process, then 
society would surely change in the future too, or like one socialist journalist 
foretold: ‘By the force of historical progress, the present system of oppression 
shall be once wiped from the stage, and its wretched henchmen shall get the 
reward they deserve? 

We have seen that the collocation method, combined with different visuali- 
sation techniques (tables, figures), can produce a massive amount of low-level 
information on the semantic content of a concept in historical texts, in this case 
on the concept of the present in three leading socialist newspapers. It would 
have taken several years, perhaps a decade, to manually close read the more 
than 180 million words printed in these newspapers. However, a computational 
distant reading helped us to discover that the socialist present was (1) negative, 
(2) systematic and (3) changeable. 

In the end, it always depends on the skills of a historian whether or not ele- 
mentary quantitative information is successfully transformed into historical 
knowledge. Here, instead of limiting our critical thinking only to the mean- 
ings of preliminary ‘results, we should also inquire into the presuppositions 
embedded in each quantitative method. For example, from a historian’s point 
of view, the collocation method is lacking comparative contexts for it operates 
only within the political language of socialism. How do we know if these found 
features of the socialist present are unique, or if they belong to a more general 
discourse of the time? 


Key Collocation: Placing the Socialist Present into the 
Contemporary Context 


The collocation method shows the strength of mutual relation between two 
words. Another useful method historians interested in language could bor- 
row from corpus linguistics is the keyness method, which can be used to show 
differences between two discourses. Keyness detects the words which appear 
more frequently than expected by pure chance in the text collection A (‘tar- 
get corpus’) compared to the text collection B (‘reference corpus’).** Next, we 
combine the main ideas behind the two methods under the concept of key 
collocation, which aims to reveal semantic differences in the use of a certain 
historical concept. 

First, as previously, we collect all the words appearing in a window of five 
words of the present (using search strings nykyä* and 'nykyi*) in the socialist 
newspapers, and combine these words into one unified corpus. Then, we 
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do exactly the same for the bourgeois newspapers, in this case for the big- 
gest liberal-nationalist newspaper Helsingin Sanomat / Päivälehti and the 
biggest conservative-nationalist newspaper Uusi Suometar. Now we have two 
corpora, and we can utilise the keyness method in order to see which words 
appear more freguently in the socialist discourse on the present compared with 
the bourgeois discourse on the present. 

Table 17.3 places our findings on the socialist present in the previous sec- 
tion into the wider context of the early 20th-century newspaper discourse. 
The only clearly negative word in the top 20 key collocates is ‘unemployment. 


Table 17.3: The key collocates of the socialist present ('nykyä*; ‘nykyi’), com- 


pared with bourgeois newspapers, 1895-1917. 


RANK WORD TRANSLATION | KEYNESS E T 
1 työläinen worker 1,917 19.8 
2 yhteiskunta society 1,586 5.8 
3 oma own (OCR error) 900 2.1 
4 kapitalistinen capitalist 598 38.5 
5 köyhälistö proletariat, the poor 587 28.7 
6 yhteiskuntajärjestelmä societal system 546 53.8 
7 liitto union 539 3.1 
8 torppari crofter 524 5.6 
9 työmäki working people 505 13.6 
(OCR error) 
10 työ work 504 2.0 
11 työväki working people 448 4.2 
12 tilanne situation 432 2.3 
13 jarjestelma system 422 2.3 
14 tyóttómyys unemployment 420 6.8 
15 työnantaja employer 414 5.8 
16 työmies working man 390 32 
17 palkka wage 368 2.3 
18 moida can (OCR error) 366 2.8 
19 järjestö organisation 357 5.3 
20 ammatillinen occupational, 321 33.7 
vocational 


Source: Lemmatised raw text files of Työmies, Kansan Lehti, Länsisuomen 
Työmies / Sosialisti, Helsingin Sanomat / Päivälehti and Uusi Suometar, 
1895-1917, distributed by the National Library of Finland, https://digi 
-kansalliskirjasto.fi/opendata. 
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However, scrolling further down the list shows that socialists indeed use words 
such as ‘misery’ (ranked 24 by keyness value), ‘miserable’ (27), oppression (87) 
and ‘hunger’ (111) in close proximity to the present much more often than 
their contemporaries, leading us to the conclusion that the level of negativity 
in the socialist discourse on the present was extraordinary. Socialists imagined 
the worst possible present. 

What about the systematic nature of the present we encountered when quan- 
tifying collocates? It exists also in Table 17.3 in the form of ‘society’, ‘capitalist, 
‘societal system’ and ‘system. In addition, we have strong supporting evidence 
of ‘social order’ (76), ‘economic system (80), production system’ (86), ‘class 
society’ (205) and ‘system of oppression' (359). 

The third feature of the socialist present, changeability, seems to be the least 
unique to labour newspapers. Apart from ‘situation, words referring to the 
changing and changeable nature of the present are missing from Table 17.3. 
Some of them can be found in the key collocation list: for example, ‘reaction’ 
(70) and its counter-concept ‘progress / development’ (900). However, looking 
at the whole list of key collocates, it seems that this feature of the socialist pre- 
sent does not stand out in the context of Finnish newspaper discourse. Perhaps 
all the major political languages of the time—from liberalism, conservatism 
and socialism to Lutheran Christianity—believed that the world was changing, 
but they had different interpretations of what exactly was changing, how fast 
and, above all, if these changes happening in the present were leading to a bet- 
ter or worse society in the future. 

It can be intellectually satisfying to find confirmation of prior interpreta- 
tions, but nothing compares to finding something new. What is new is that the 
socialist protagonists of the present differ starkly from the bourgeois ones. 
The socialist version is based on the antagonism between good (‘worker’ ‘work- 
ing people; ‘working man, ‘proletariat / the poor; ‘crofter’) and evil (‘employer’) 
actors. This fundamental feature of the present was not found in the traditional 
collocation analysis, for the antagonism is so deeply rooted in the overall social- 
ist discourse that it does not specifically stand out in the context of the socialist 
present. This socialist tendency to construct political agency through a vigorous 
repetition of collective singulars, especially ‘the working people’ and ‘proletariat 
/ the poor; can only be invoked when comparing socialism to other political 
languages of the time, in this case with the help of the key collocation method. 
Correspondingly, the trade union jargon (union; organisation”) escapes the 
collocation analysis, but it is clearly visible in the list of key collocations. 

While collocation concentrates on the architecture of a concept in isolation, 
within only one discourse, key collocation can reveal the uniqueness or gen- 
erality of these historical conceptual architectures. In the case of the socialist 
present, the latter method seems to confirm most of the findings achieved in 
the collocation analysis. Nevertheless, we should also respect the fact that an 
opposite result was possible. For example, we know that a negative present is 
not a feature confined to the socialist discourse of the early 20th century (old 
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people complained about present children and manners already in the days of 
Plato*), but the point is that key collocation gives us a comparative empirical 
context, against which we can measure how much (for example, ‘negativity’) is 
much. Constructing comparative contexts through traditional close readings 
is a labour-intensive task. Perhaps this is one reason why historical temporali- 
ties have been analysed based on a rather limited amount of sources. 


From Sources to Wisdom: DIKW for Digital Historians 


In this final part before the concluding remarks and return to the sun of 
socialism, we rise from the empirical case study to a more abstract level, by pro- 
viding a theoretical account of our intellectual journey so far with the help of 
the so-called DIKW pyramid, a concept that has been influential in the infor- 
mation sciences, knowledge management and systems theory for decades." 
The pyramid describes the hierarchical relations between Data, Information, 
Knowledge and Wisdom. Although the pyramid has received criticism from 
several directions, we believe that a slightly revised version of the pyramid 
can be useful for explaining not only the analytical process of this chapter, but 
also, more broadly, the idea and promise of distant reading in the context of 
digital history. 

The vertical axis in the model represents what is usually described as 
‘connectedness. Connectedness increases as we climb up the ladder towards 
wisdom.” The idea is not completely unfamiliar to historians, but we 
traditionally prefer the word ‘context’ when describing the process of histori- 
cal analysis. In fact, the etymological root of context means weaving or join- 
ing together.“ Thus, we can replace ‘connectedness’ with ‘context’ in the model 
without a bad conscience. 

Then, we should add one layer below data, that is, historical sources.*! The cen- 
tral difference between sources and data to digital historians is that the latter is 
machine-readable, and currently only a small part of historical sources is avail- 
able for computational analysis as data. In the context of this chapter, physical 
historical newspapers are sources, whereas their digital representations—the 
PDF images and text files we downloaded from the National Library—belong 
to the category of data. 

What is information, then? Here, we differ from general definitions of 
information as data + context, or ‘data + meaning,” for the words ‘context’ 
and ‘meaning’ carry too much historical weight in the humanities. ‘Information 
as processed data is a more suitable definition for our purposes.? Examples 
of information would be simple word frequency time series (for example, 
Figure 17.1), word frequency tables (for example, Tables 17.1-17.3) or visuali- 
sations of words appearing close to one another (for example, Figure 17.2). In 
each of these examples, raw data has been computationally re-organised into 
low-level information. 
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Figure 17.3: The revised version of the DIKW pyramid. Source: Author. 


In our model, information does not include the historian’s interpretation 
of information, which is located one step higher in the pyramid. Knowledge 
is information that is interpreted and contextualised by a human scholar. In this 
chapter, knowledge refers to assigning meaning to individual tables and figures, 
and then connecting these meanings not only to one another, but also to previ- 
ous research—a difficult enterprise we undertook in the preceding pages. 

The top of the pyramid is called wisdom, and it is the most controversial layer, 
for it escapes a clear definition." Thus, it might be best to demolish it entirely. 
Russell Ackoff, often acknowledged as the founder of the pyramid, defined wis- 
dom as evaluated understanding.* A historian could perhaps imagine wisdom 
as an ability to see which parts of the specific knowledge she has produced 
is relevant in answering the most complex questions of history. If knowledge 
means deciphering the meaning ofthe socialist sun, wisdom requires that a his- 
torian understands the meaning of this meaning under the aspect of eternity. 
(We have not reached an understanding this deep in this chapter.) 

Now that we have reconstructed the pyramid, we can explain the point of 
distant reading in digital history. Distant reading aims at making the founda- 
tions of our historical explanations more solid, by piling up more stuff in the 
bottom ofthe pyramid. In other words, we want to increase the scale of histori- 
cal sources, and this is possible as long as our sources are in a machine-readable 
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form. A historian can hardly read 100,000 words per day, whereas a computer 
can ‘read’ more than billions of words a minute. Of course, such distant reading 
is not reading with an understanding, but rather finding connections between 
words on a more primitive level. 

However, we should not underestimate the fact that distant reading tech- 
niques, already in their current premature phase, can produce information that 
cannot be produced by human cognitive abilities alone. It is then up to a scholar 
to make sense of this elementary information. When a seasoned historian criti- 
cises your digital history research for ‘lacking context, he or she probably means 
that you have not reached the level of historical knowledge, in other words, 
you have not paid enough attention to connecting your preliminary findings 
to one another and to previous research. Data nor information interprets itself. 
Nowadays, it is already a cliché that distant and close reading methods are com- 
plementary. With the visual aid of the pyramid, we could rephrase this idea: in 
digital history, the goal is to shift our limited cognitive energy upwards.'* We 
distribute the most monotonous part of our historian’s craft to the machines, in 
order that they could find patterns and trends that need human explanations. 
Machines are fast, precise and tireless, in other words, good at refining data to 
information, but, at least by now, only humans are able to refine information 
to knowledge. 


Conclusions 


What have we learned from our distant readings of the present, from turning 
our macroscope towards the sun of socialism? First of all, as expected, the Gen- 
eral Strike of 1905 seems to be a pivotal moment, a measurable rupture, in the 
history of socialist temporality. The words referring to the present increase rap- 
idly after the strike. Metaphorically, the strike meant a mental earthquake that 
had long-term consequences. The strike reshaped the political environment so 
dramatically that old ideological maps lost much of their ability to explain con- 
temporary reality. In this new situation, the political language of Finnish social- 
ism turned out to be a temporary winner, gaining most votes in each of the 
post-strike elections. According to Hannu Soikkanen, socialism gave working 
people a coherent and solid world view which stood in sharp contrast to their 
unstable conditions." We could specify this argument from the perspective of 
temporality by adding that a new understanding of time formed one important 
part in the breakthrough of the socialist world view. In those turbulent times, 
the political language of Finnish socialism offered the most believable inter- 
pretation of the present for the working people. As demonstrated by our col- 
location analyses, labour newspapers convinced their readers that the present 
misery was caused by the system, and this system could and should be changed. 

Thus, based on our distant readings, we argue that the meaning of the social- 
ist sun in the early Finnish labour movement was not limited to the temporal 
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dimension of the future. The socialist sun affected the present too, by making it 
appear in a new, bad light. When one saw the red sun in the future, simultane- 
ously the shackles of capitalism came out in the broad daylight of the present. 
We also learned that socialists did not see the present system as eternal or divine, 
but as historical and man-made. In fact, we could contrast the sun of socialism 
with the biblical sun which was the same for everyone, or in Jesus’ words: “He 
(Father in heaven) causes his sun to rise on the evil and good, and sends rain 
on the righteous and on the unrighteous?* Our distant readings, especially the 
key collocation analysis, revealed that the socialist sun was shining exclusively 
for the ‘working people, ‘workers’ and ‘the proletariat, but not for ‘the capitalist 
employers: Thus, unlike the biblical sun that directed people's attention towards 
the hereafter, the red sun of socialism highlighted earthly problems. 

If we wanted an even deeper understanding of the socialist sun and tempo- 
rality, in other words, if we wanted to get closer to the top level of historical 
wisdom in the DIKW pyramid, these same analyses (comparable relative word 
frequencies, collocation, key collocation) should be performed for each dimen- 
sion of the time: the past, present and future. In addition, we could broaden 
our quite narrow focus from word frequencies to richer forms of linguistic 
information. For example, experimenting with verb tenses (the socialist use of 
past, present or conditional forms) sounds reasonable when solving questions 
related to historical perceptions of time. 

In the end, we could speculate for a moment: if the telescope and microscope 
changed the fields of astronomy and biology, will the macroscope, or compu- 
tational methods in general, change our historical research?” We believe the 
answer is positive in the long term, but we are not quite there yet. According 
to Max Weber, a new ‘science’ emerges where new problems are pursued by 
new methods,” but in this chapter we have mainly answered old questions 
by using novel tools developed outside the community of historians. While not 
revolutionising the field, these tools can help us to improve our craft, in our 
everlasting quest towards historical wisdom. 


Notes 


! "Tyówáestón olosuhteista Karjalassa; Työmies, 21 August 1903, p. 1, https:// 
digi.kansalliskirjasto.fi/sanomalehti/binding/728854?page=1. 

? The quote is from Korff 1993: 124. On political symbols in the context of 
socialist movements, see, e.g., Steinberg 2002: esp. 224-246; Hake 2017: esp. 
100-119. 

? See, e.g., Soikkanen 1961; Haapala 1986; Ehrnrooth 1992; Suodenjoki 2010; 
Rajavuori 2017. 

^ On Marxian interpretation of history, see, e.g., Soikkanen 1961: 30, 91-92, 
231-232; on socialist utopianism, see Ehrnrooth 1992: 169-177; on social- 
ist eschatology, see Huttunen 2010: 57-65. 
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* Kaihovaara 1986: 34-35; Kaihovaara 1991: 52. 

° Koselleck 2004: 263-267. 

7 Tikka 2017. 

* Alapuro 1988: 115-117; Eley 2002: 66, table 4.2. 

? Le Macrosope was the title of Joél de Rosnay’s classic introduction to sys- 
tems theory (1975). Katy Bórner reintroduced the concept in her often- 
cited article on plug-and-play macroscopes. See Bórner 2011. 

? Available at https://digi.kansalliskirjasto.fi/opendata?language=en. For 

more information on the historical newspaper collection, see Pääkkönen et 

al. 2016. 

The absolute numbers for the search string ‘nyky*’ in 1900-1917: 72,874 

hits in 96,354,158 total words in Tyómies; 121,473 times in 157,438,611 

total words in Uusi Suometar; and 96,923 times in 135,166,576 total words 

in Helsingin Sanomat / Päivälehti. All the calculations in the chapter were 
performed using AntConc (3.5.7.), available from http://www.laurencean- 
thony.net/software. 

Kujala 1995: 42-53. 

Kaatra 1906: 4. Translation by the author. 

14 This analysis is based on comparing the words appearing in close prox- 

imity to the present before (1.1.1904-30.10.1905) and after (7.11.1905- 

31.12.1907) the General Strike of 1905 in Tyómies. In practice, all the words 

appearing in a window of five words to the left or right of the search string 

‘nyky” were collected into one ‘post-revolutionary’ mini-corpus, which was 

then statistically compared with all the ‘pre-revolutionary’ neighbouring 

words of the same search string. The method is explained more carefully in 
the section entitled ‘Key Collocation: 

Anderson 2006. 

Kokko 2016: 24-25, 44-45, 413-425. 

Anderson 2006: 24-26, 33. 

18 On conceptual history, see, e.g., Koselleck 2004; on discourse analysis, 

Baker 2006; on the Cambridge school, Skinner 2002. 

Finnish text files were lemmatised with the LAS command-line tool. See 

Makela 2016. 

There are different collocation metrics. We have used mutual information 

(MI) which measures the probability of whether or not the relationship 

between the search word and its neighbouring word is likely to exist by mere 

chance. The higher the MI value, the stronger the link between two words. 

With the high cut-off point, many commercial adverts that contain words 

referring to the present ‘drown’ in the sheer mass of newspapers texts since 

not many of them circulate in each of the three chosen socialist newspapers. 

Thus, the ideological patterns of the socialist language use become more 

visible when the cut-off point is high. 

2 The visualisation was achieved with LibreOffice (5.1.6.2), available from 
https://www.libreoffice.org/. 
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? de Bolla 2013. 

*4 On tsarist repression in the socialist newspapers, see, e.g., ‘Suomen suhde 
Venäjään, Sosialisti, 5 December 1908, p. 2, https://digi.kansalliskirjasto. 
fi/sanomalehti/binding/717172?page=2; “Wenäjä ja me; Kansan Lehti, 
13 December 1909, p. 1, https://digi.kansalliskirjasto.fi/sanomalehti/bin 
ding/649887?page=1; “Wirkaryssät uusia eduskuntawaaleja hommaa 
massa, Työmies, 23 October 1912, p. 1, https://digi.kansalliskirjasto.fi 
/sanomalehti/binding/1187961?page=1. On domestic bourgeois oppression, 
see, e.g., Suhteemme hallitukseen; Kansan Lehti, 4 July 1907, p. 1, https:// 
digi.kansalliskirjasto.fi/sanomalehti/binding/645861?page=1; “Työväen suo 
jeluslait; Sosialisti, 6 April 1911, p. 2, https://digi.kansalliskirjasto.fi 
/sanomalehti/binding/1177250?page=2; 'Budjettikeskustelu eduskunnassa; 
Työmies, 23 April 1913, p. 2, https://digi.kansalliskirjasto.fi/sanomalehti/bin 
ding/1187515?page=2. 

% Soikkanen 1975: 120-197. 

2 "Euran Pohjoispäästä, Sosialisti, 27 April 1907, p. 3, https://digi.kansalliskir 
jasto.fi/sanomalehti/binding/702923?page=3. 

7 ‘Lansi-Teiskosta, Kansan Lehti, 12 January 1912, p. 3, https://digi.kansal 
liskirjasto.fi/sanomalehti/binding/1238591?page=3. 

"Miksi me vaadimme järjestelmän muutosta; Työmies, 23 December 1902, 
p. 2, https://digi.kansalliskirjasto.fi/sanomalehti/binding/739481?page-2. 

Soikkanen 1961: 29. 

%2 "Hiukan sosialismista, Kansan Lehti, 22 June 1911, p. 1, https://digi.kansal- 
liskirjasto.fi/sanomalehti/binding/1238279?page=1. 

?! ‘Kollektiiwiset työ- ja tariffisopimukset Englannissa ja Amerikassa; Työ- 
mies, 4 August 1909, p. 2, https://digi.kansalliskirjasto.fi/sanomalehti 
/binding/730033?page=2. 

? "Anarkia, pakkoluowutukset ja sosialidemokratit; Sosialisti, 28 September 1907, 
p. 2, https://digi.kansalliskirjasto.fi/sanomalehti/binding/703031?page=2. 

3 'Sosialistin — kirjapaino-osuuskunnan kirjapainon — pakkokappaleasia; 
Sosialisti,25November 1913,p.2,https://digi.kansalliskirjasto.fi/sanomalehti 
/binding/1178786?page=2. 

% Baker 2006: 125-128. There are different keyness metrics. We have used 
the log-likelihood (four-term) test in order to measure statistical 
significance. In addition, we have measured effect size using ratio of relative 
freguencies. 

35 Ouintelier 2007: 165. 

* For example, the only monograph focusing on Finnish temporality in the 
late 19th century is based on a correspondence of one educated family in 
eastern Finland. See Ollila 2000. 

? Rowley 2007. 

38 See, e.g., Tuomi 1999; Fricke 2009; Jennex 2009. 

? See, e.g., Bellinger, Castro & Mills 2003. 

^ Hyrkkänen 2009: 260. 
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“| Perhaps we should add even one more layer to the very bottom of the pyra- 
mid, that is, life itself. Peoples rich lives leave only fragmentary sources for 
historians (and computers) to analyse. 

? See, e.g., Worboys & Duckham 2004: 5; Floridi 2010: 20. 

5 Ackoff 1989. 

4 Hoppe et al. 2011. 

^ Bellinger, Castro & Mills 2003. 

46 This idea is taken from a very different academic field: plasma physics. See 
Carpenter & Cannady 2004: 4-6. 

” Soikkanen 1961: 30. 

‘8 Matt. 5:45. On the many meanings of the sun in the Bible, see Patterson 2011. 

® Graham, Milligan & Weingart 2015: 1-2. 

5 Franco Moretti made Weber' quote famous for digital humanists in his 
pioneering article on distant reading; see Moretti 2000. 
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PART IV 


Conclusions 


CHAPTER 18 


The Common Landscape of 
Digital History 


Universal Methods, Global Borderlands, 
Longue-Durée History, and Critical Thinking 
about Approaches and Institutions 


Jo Guldi 


In old-fashioned social history, the study of the common landscape’ used to 
serve an index of social and cultural difference.! Take two regions, two neigh- 
bourhoods or two houses, side by side: their differences illuminate cultural 
ideas about hierarchy, the reality of divergent incomes, and separate relation- 
ships with the material world. Just so, the current volume invites us to conduct 
a survey of the ‘common landscape of digital history in all its variation. 

‘Field’ though it might be in name, the domain of history as practised by 
scholars of different methodological and political orientations, geographi- 
cal and temporal subjects of study, and institutions around the globe is really 
more of a patchwork of different fields and sub-fields, connected by an 
infrastructure of main-travelled roads and divergent footpaths that only 
precariously serve the whole. Some of these fields are closely guarded by an 
embattled elite, others plowed by an army of workers, still others remote prov- 
inces known only to a handful of toilers. Here and there, social historians and 
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business historians have already harvested crops for generations. The bumper 
harvest of the future promised by new technology frightens some with its scale 
and lack of care, a question addressed by at least two chapters in this volume. 
In other places, however, tinkerers deploy the new algorithms to cottage-sized 
gardens of their own liking. Whether they garden with medieval heretics or 
parliamentary discourse, these cottage-gardeners are increasingly dabbling in 
some new technology—be it topic models, vectors or other tools found in this 
volume. Their work is just the garden. The landscape itself is changing as a 
result of these multiple efforts: it is a changing ecosystem, rebalancing in reac- 
tion to human labour, sometimes enclosed or guarded. 

The landscape metaphor, to be sure, obscures strong forces of historical 
change. In Chapter 4 above, Mats Fridlund offers an oceanic image for history 
and its trends. Forces whence we know not where transform the discipline, 
moving everything with them. For Fridlund, technology is the wave: in the 
modern era, technology is the reigning factor. Even scholars who eschew quan- 
titative methods depend upon the word processor, JSTOR articles and news- 
paper indexes that self-proclaimed non-digital historians depend upon. Using 
the analytical insights of the history of technology, Fridlund deconstructs the 
digital, and argues that, like nature, technology is always already with us. 

Indeed, there is something elemental in the transformation of scholarship in 
the modern era. New work with algorithms potentially participates in such a 
tide, and what is thrilling about it is the sense that any scholar, anywhere in the 
world, might contribute to its movement. Approaches to the study of history 
developed by one cadre of researchers working on 19th-century Finland may 
rapidly translate to studies of 19th-century Britain, of the 20th-century United 
States, or of medieval China. 

Yet, the diversity of different specialisations, periods and interests persists, 
despite the waves and winds that blow through from time to time. As the rich 
and diverse studies of this volume have demonstrated, digital history is not 
so much a field or sub-field in this rich and varied landscape, as a universal 
approach to history. The practices and methods of digital history are transform- 
ing plots here and there across the entire landscape: here a neural net, there a 
topic model, elsewhere a map or a social network, spanning the entire range of 
periods, geographies, orientations and institutions, such that no field of histori- 
cal scholarship, however remote, is today too far away from some garden bed 
where some scholar has applied a computer-aided practice to their labours. 

For proof of this intermingling diversity, one has only to look at the rich set 
of different algorithms and questions in this volume. Each chapter contributes 
a different orientation and algorithm, such that the volume as a whole surveys a 
wide variety of different times, places and orientations. Nonetheless, the neural 
nets used to study medieval heresy have been used elsewhere to analyse the 
history of photography, and the topic models used here to analyse German 
humanist discourse have been used elsewhere to study the history of British 
parliamentary debates about infrastructure and 20th-century American news- 
paper coverage of cities. 
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Surveying digital history thus implies looking over the whole of the rich land- 
scape of spaces that constitute traditional history. All of its periods, geographies 
and intellectual orientations are being reworked according to collective invest- 
ments, new approaches, and the questions and problematics opened up thereby. 

But is a common set of approaches, shared between knowledge-workers, 
necessarily the same as a hegemonic project of empire? What is the difference 
between a ‘universalising’ set of common practices and a ‘universal’ history, 
a single narrative that is supposed to provide a final answer for all times and 
places? Will the oceanic wave of digital history drown out the lovingly culti- 
vated diversity of the cottage farmer? 

In this chapter, I will set out to answer these questions, exploring how the 
methods of the digital historians exemplified in this volume overlap with 
the universally applicable discourses of critical theory, how far the two are 
exclusive and where they are now collaborating together to forge a new set of 
critical approaches to the study of time. I will use the examples in this volume 
to test and refine earlier assertions about the trajectory of digital history with 
regard to the longue durée, microhistory, critical theory and politics. I will con- 
clude that the imprint of the longue durée is clear; the political implications of 
practising digital history less so. Along the way towards answering those issues, 
this chapter will remark on several major features of the common landscape of 
history as it is now evolving. 

This chapter will therefore raise questions about the institutional, national 
and geographic alignments that make ‘doing digital history’ possible, reflecting 
further on some of the themes of institutional investment covered in this book. 
It will raise questions about the future institutional geography of scholarship, 
drawing on the implications raised by this volume—one of the most methodo- 
logically and forward-thinking volumes on digital history at present—coming 
from a European nation whose sometime borderland status flags important 
changes in the geography of scholarship under the digital turn. 

Finally, this chapter will look ahead to new trends in scholarship visible 
in the contributions from this volume, notably: a rise of methodological articles 
that model the ‘bridge’ between close and distant readings, the rising impor- 
tance of scholarship that targets the ‘fit between questions and algorithms, 
and the increasing importance of international institutional collaboration. As 
the chapters in this volume demonstrate, digital history is well on its way to 
establishing theories and methodologies that satisfy these most critical criteria 
of scholarly investigation; in the years that follow, scholars who pay attention to 
these themes will have even more to look forward to. 


The Universalism or Common Space of the Digital Humanities 
The discussion of methods herein, one might say, offers a truly ‘universal’ ten- 


dency for scholarly exchange, and it is worthwhile pausing to understand what 
we mean by that. In another volume, the reader who specialises in medieval 
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history might flip only to the chapter written by the medievalist. But it is a tru- 
ism that in digital history, and the digital humanities more generally, to pass 
over the other sections would be an error. The methods developed for one sub- 
field will be relevant to the next sub-field tomorrow if they aren't already today. 

We see in these critical reflections the shape of new standards for historical 
work as scholars puzzle over the fit between particular algorithms and ques- 
tions. In Chapter 16, Välimäki and his co-authors (Reima Välimäki, Aleksi 
Vesanto, Anni Hella, Adam Poznański and Filip Ginter) used neural nets to 
confirm the authorship of the Refutatio Errorum, an anti-heretical treatise 
from Germany in the 1390s, establishing, in the process, a new standard for 
author detection. In Chapter 15, Heidi Hakkarainen and Zuhair Iftikhar prove 
that topic models are well-suited to engaging Koselleck's idea of concept his- 
tory, linking concepts with temporality. The chapter describes a vast culture of 
experimentation and discovery, as scholars try out competing algorithms, test- 
ing the fit of each method to the scholarly questions independently identified 
as problematic in each field. 

To dub such a capacity for common meeting ‘universalism is to underscore 
that anyone might play with any of the ideas at stake, even while an enormous 
pluralism existed of period, subject and political orientation. It would certainly 
not imply the universal applicability of any one fact or conclusion reached by 
the research, which is subject, as all historical research is, to the revisions of 
new discoveries, new archives and new approaches. Perhaps an even better 
term than ‘universalism’ for what we are investigating would be the ‘common 
space’ of the city that Hannah Arendt identified as a metaphor for the best 
strivings of both Enlightenment and democracy: that they could be accessed by 
anyone, that they enlivened and illuminated all lives that touched them.” 

By employing the image of common space' in the city as a metaphor for a 
certain aspiration in discursive activity, Arendt underscored the question of 
access: since the European Enlightenment, the modern city was defined by 
spaces of equal access, spaces that didn't shut people out. We might say the 
same about digital history: whatever the critique from the outside, the prac- 
titioners of digital history have taken pains not to shut out any practitioners 
whomsoever, and many of them have worked at length to convert digital tools 
into material for critically inspecting empire, gender and race? 

What about the contention that digital history is itself imperial and univer- 
salising in nature, threatening to draw all history practitioners into a single 
method, problem of study and macrohistorical overview of the longue durée? A 
notorious example of universalising claims by biologist interlopers relates to a 
decade ago, and the haunting claim, made in the national media of the United 
States, that historians would be obviated by the coming of the computer.’ As 
digital history is actually practised, we see very little of this. More relevant is 
a portrait of individual scholars or scholars in small units working together to 
execute some new perspectival opening onto their sub-field—medieval her- 
esy, the Finnish Parliament, the Finnish borderlands or European humanists. 
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Not one chapter in this volume gestures towards a universalising macrohis- 
tory of the longue durée that would eliminate the perspective of workers or 
feminists beneath the triumph of the nation or empire. Not one chapter in 
the whole book even claims to dispense with or obviate the field of social his- 
tory, biography, intellectual, cultural or political history, although most chap- 
ters build on aspects of other forms of history in some respect. To paint the 
common landscape of scholarship, the artist would truly have to recognise a 
thousand digital histories, not one digital history, stretching along the plain, 
informed by exchanges, building on works already in progress across the land. 

Another version of the complaint against digital history as universalising and 
therefore coercive borrows an image from Lawrence Stone, who in the 1970s 
warned against quantitative enterprises of all kinds as commandeering gradu- 
ate students into massive, pyramidal projects where their intellectual inquiry is 
dictated from above and individual initiative is squashed.? This, too, seems not 
to be the case. In the quarters of digital humanities where graduate students 
are enlisted, they are often featured as first-authors on projects that they were 
invited to craft with the skill and support of leaders in the field.* 

Graduate students and early-career faculty in this volume, including Reetta 
Sippola and Matti La Mela, are particularly distinguished as early adopters and 
explorers of new technology, who have been willing to extend their training 
in some sub-fields to adopt new technologies, test a theory or explore a time 
period. The borrowing—from computer science, statistics, the digital humani- 
ties, linguistics or library science— means that these early adopters are fast at 
work in building up the discursive commons that Arendt so praised. 

Whatever we call it, exchange between sub-fields has quite a bit going for 
it. Discursive commons, technological borrowing and other such scholarly 
programmes of exchange represent an instinct for common space or the uni- 
versal that, according to science, has been waning of late in the academy. This 
tendency to narrow the study of the universal back to readings in ones sub- 
field, we learn, is not merely a habit of the humanities, but can be quantitatively 
identified in the social sciences and sciences as well. According to the research 
of sociologists of knowledge such as James Evans, scholars today generally 
cite fewer readings directly outside their realm of concern than did scholars 
a generation ago. It seems to be the case that internet-enabled web catalogues 
have restricted universal reading habits of borrowing from nearby disciplines, 
whether for critical theory or for other inspiration. The sheer overwhelming 
scale of available knowledge left scholars paralysed by the task of keeping up 
with their nearest cohort. 

History seems to indicate that information economies more generally are 
marked by a pulse of broadening the process of collecting information and 
refining the information thereby collected. This pulse of information analysis 
has been studied across decades-long exchanges about early-modern botanical 
knowledge, as well as on the personal scale of Darwin's notebooks." What we 
are calling ‘universal! moments or moments of common spaces’ seem to be 


332 Digital Histories 


moments of expansion, when scholars looked to particular discussions as 
relevant for scholars who worked across a broad variety of periods and places. 

Today, the digital humanities partake of a similar moment of universalism, 
in which scholars of 1980s online culture, 19th-century novels and Chinese 
medical texts regularly meet and compare notes, finding algorithms to borrow 
from one another. As a result, a discussion of methods offers a meeting ground 
on a broad scale, as well as an opportunity to compare notes across different 
sub-fields and disciplinary orientations. 

There have been other moments of expansion on a theoretical level, where 
humanities disciplines are reforged through insights from without. Such, 
for instance, was the impact of the steady importation of continental phi- 
losophy and critical theory into the humanities since the 1970s. A Freudian 
reading of Augustine could be interesting to those contemplating the nature 
of the biographical subject during the American Revolution precisely because 
the method could be so easily transported and applied to other fields and sub- 
jects. By the same token, a Foucauldian reading of colonial India might, in 
theory, interest readers from the social history of industrialisation. 

Indeed, it is possible that methodological moments of unification offer a nec- 
essary antidote to the paradigm of modern specialisation of knowledge, with 
its tendencies to mince fields into sub-fields and further sub-fields, with the 
concomitant risk of knowing more and more about less and less. The arrival 
of critical theory in the 1970s meant that the scholar of Virginia Woolf and of 
Classical Athens could find both a common meeting ground and a common 
language in terms of inclusiveness, femininity, the knowledge of the state and 
the construction of the individual. 

At the same time, however, the digital humanities are beginning to see a 
moment of critical inward inspection, of the refinement of processes and 
pipelines. The ‘universal’ impulse in the digital humanities is thus giving way 
to another phase of information analysis—one predicated upon the close 
inspection and comparison of algorithms, the attention to metadata, and the 
examination of OCR errors and named-entity detection. As Kimmo Elo points 
out, this domain of attention is a necessary part of the process of refinement 
if the garden of earthly methodologies is to bear fruit. The labour of refine- 
ment and inspection will almost certainly be a domain of work that requires 
the labour of historians. 

If building interoperable tools and applying them to great questions of his- 
tory represents the universal access of the city square—liberating with its 
sense of wide access and possible exchange—the work of refining metadata 
and inspecting tools is more like the garden plot of digital history, a place that 
requires focused attention and hard work to produce useful results. The meta- 
phor is complete if we imagine that the garden of tool- and data-refinement 
produces useful results that can be taken back to the universal exchange of the 
city square. As Johan Jarlbrink points out, inspecting the results of metadata 
analysis through simple methods such as the ‘tally’ stands to help us unpack the 
‘black box’ of digital learning. 
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A volume of the present kind offers a rare glimpse into how the entire breadth 
of history is shaped—in its temporal range, from studies in the history of Finn- 
ish feminism to medieval heretics to labour politics, and in its methodologi- 
cal range, from the prosecution of new frontiers with existing tools like topic 
modelling to the close attention to algorithms, metadata and tallies of OCR 
errors. The scholar who represents each period, place, theme and method has 
a separate body of texts and different methods, to be sure; indeed, the scholar- 
ship in question was in part selected so as to adequately represent the diversity 
of possible methods, approaches and statistical rigour practised across history 
as a whole. 

In offering a meeting ground of methods and periods, a methodological vol- 
ume offers an important service to the discipline as a whole. The scholar of 
medieval heresy may wind up later borrowing the tools from the scholar 
of humanism or labour politics, even if their data for the moment looks entirely 
different. Digital practices tend to lend themselves across formats, political 
interests, historiographical orientations, periods or geographies. Thus, digi- 
tal tools draw historians back to a certain methodological universalism, even 
in the face of other kinds of plurality, insofar as they encourage practices of 
reading-across-boundaries, in the form of conferences or volumes, like this 
one, that reward the practice of rich learning in new directions. 


The Pedagogical Role of Discourse 


In order to enjoy knowledge as a commons, one mandate is that the experts 
of the commons must be motivated with an eagerness to explain, to render 
accessible the more difficult concepts that they have assembled for the use of 
others. In the era of critical theory, difficult writers from Marx to Heidegger got 
interpreters who translated their concepts into ready-to-wear essays: ‘Benjamin 
for Historians, the beginner's guide.* Digital humanities can only claim to be 
a ‘commons’ accessible to all sub-fields of history insofar as it too has been 
equipped with multiple translation projects, rendering difficult statistics and 
algorithms within the grasp of the total novice. 

The present volume is a monument to the pedagogical impulse of digital 
history. The writers assembled here have taken pains to draw down the abstrac- 
tions of algorithms, statistics and databases into language of period, inquiry 
and method familiar (or at least tractable) for traditional historians. Each 
chapter introduces its method, algorithm and dataset in careful detail, presum- 
ing little prior acquaintance with historical method. Writers test and explore 
the possibility of ‘false positives’ raised by misinterpreting topic models. They 
ask about the bridge between macroscopic ‘overviews of the material and 
‘close reading, and how an overview can or should guide the reader back to 
individual texts or episodes. They seek to open up the ‘black box’ of digital 
analysis, to unpack and critique its workings, and to thus devise a new 
machine for critical analysis of the past. The result is a series of essays that 
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are pedagogically precise: an instructive model about how to describe a new 
process for the use of other scholars. These are the tactics that historians of the 
future should recognise as a new standard for how openly, precisely and clearly 
we write—for what a truly ‘universal’ or common practice of historical reason- 
ing looks like. 

The authors of the individual chapters of this volume have taken the editors' 
direction for clarity, instruction and critical thinking to heart. In Chapter 9 
on feminist history in Finland, the question under investigation is the correct 
approach to take, and Heidi Kurvinen playfully offers her own experiences of 
success and failure as an example for other scholars venturing onto this unpre- 
dictable frontier of knowledge, where the rules of adequate preparation are not 
entirely spelled out as yet. In Matti La Melas Chapter 11 on the freedom to 
roam in Finnish parliamentary debates, he walks the reader through a step- 
by-step recreation of the process, suitable to educating historians with no prior 
use of the algorithms or techniques at stake. In Chapter 16 by Välimäki and 
colleagues, the authors critically explore the use of each part of their data (the 
examples of anti-heretical texts with known authors used to train the neural 
net, the authors proposed by other scholars) and each algorithm (the Support 
Vector Machines used to detect authorship, the vector machine used to clean 
the data). Computational author detection, they urge, should become the new 
currency of the discipline, amplifying other criteria of authorship detection 
such as structure, manuscript tradition and argumentation. 

The writers in this volume have also shouldered the burden of offering their 
own interpretation of the new approaches necessary to digital history. From 
the ‘tensor history’ of Timo Honkelas foreword to the ‘resource criticism of 
Mats Fridlund's Chapter 4, their inquiries point to the importance, as history 
adopts new methodologies, of adequately spelling out the limits of an inquiry, 
the sources of the data, the silences and limits of each inquiry and the possibili- 
ties opened by particular algorithms. 

I read the theoretical trajectory of this volume as a powerful demand that 
each historical encounter explicitly renders obvious its limits, both in terms 
of sources and in terms of methods. And if that seems like something that 
historians have already done in carefully describing their paper archives and 
theoretical baggage, consider this: what if every historical study that leans, in 
part or in whole, on secondary sources, newspapers, parliamentary debates or 
other digitised corpora considered not only the limits of the micro-archive, but 
also the limits of the macro-archive? We would have to choose forms of analysis 
that allow us to form an overview of the archive as a whole, for the purposes of 
both longue-durée analysis and acknowledging the historical situatedness and 
inherent bias of each archive. In essence, we are being invited into a new age 
of historical criticism, one that brings the ‘capital/periphery’ critique of post- 
colonial studies home in the sense of recognising the limits of data and critique, 
acknowledging the way in which our source-base and view of history has been 
shaped by power and limited all along. 
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The Universal Borderlands of the Digital Humanities 


One token of the universalism of the digital humanities is that the contribu- 
tions of a group of historians stationed mainly in Finland (rather than Britain 
or the United States) could dare to claim for itself so vast a title as Digital Histo- 
ries, as if aiming to define the new field. Nor is the volume bound by an exclu- 
sive orientation to Finnish history: the subjects of this volume range across 
Europe. Perhaps because of the ‘universal’ tendency of digital humanities 
research outlined above, digital humanities research in Finland is, indeed, as 
diverse as the historical discipline in Finland. The present volume represents 
an earnest attempt to define the historiographical range of questions presented 
by historians working with digital techniques, to cover the range of algorithmic 
and metadata practices used by our colleagues, and to introduce new scholars 
to best practices and techniques that merit attention from scholars of any time 
period or geography. 

The digital humanities provide an arena of study that is ‘universal’ in that it 
incorporates broad engagement with international currents in intellectual and 
cultural history, feminist history and the history of science. The topics herein 
contained range from medieval studies of authorship among Waldensian heret- 
ical texts to the popular reception of scientific astronomy in the 18th century. 
Far from being rigid and inflexible, the chapters here show off an astonishing 
variety of methods, each the mirror of a historical problem with its own histo- 
riographical legacy stretching back over decades. The research projects in ques- 
tion demonstrate something of the expansiveness of interest and time period 
that could be found within most national traditions. 

International sharing of methods and data has been intense over the last dec- 
ade, among historical practitioners, and this international intensity has raised 
a number of new international capitals for digital humanities research—among 
them digital humanities centres and nodes of excellence in Umea, Uppsala, 
Venice, the Max Planck Institute Berlin, the Dutch national research infra- 
structure CLARIAH, the University of Sussex, the Language Bank of Finland, 
the University of Helsinki and the University of Turku. On a global level, digi- 
tal humanities researchers in Singapore, Taiwan, China and Latin America 
are producing significant demonstrations of new methods. The map of digital 
history practitioners on the avant-garde of methodology is both more inter- 
national and marked by the presence of younger universities and research 
institutes than a more traditional map of excellence in the humanities. With 
all respect to the digital humanities summer institutes that introduce many 
scholars to the techniques of DH, this is simply not a field of which one gains 
mastery by a single visit to a great master at Oxford, Cambridge or Harvard: the 
field is moving too quickly, with nodes of excellence developing in seemingly 
improbable places. 

The conditions for international participation in digital history are set by tra- 
jectories that have something to do with the existence of national traditions 
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of history in all of those places, as well as the rise of an international infor- 
mation economy over the lifetime of the authors of the present volume. As 
Paju’s Chapter 2 explains, historical researchers in Finland have experimented 
with digital techniques since the 1960s, much as American researchers trace 
their roots back to experimentations with punch cards and Latin codices in the 
1950s. Like universities around the world, Finland has a national tradition of 
historical research with centres of excellence of teaching and learning. Finnish 
scholars, building on generations of institutional development, have had the 
opportunity to theorise important questions about statistical measures and AI, 
close reading and distant reading, and the role of learned societies in building 
and maintaining the intellectual infrastructures of today. 

As such, the volume represents an event horizon within the global practice 
of historical scholarship which marks the rise of institutional 'borderlands that 
are less well-established than the Ivy League or the ancient universities of West- 
ern Europe. Finland's distinctiveness within the digital humanities thus offers a 
paradigmatic path for other national bodies of researchers who wish to vie for 
distinction on the frontiers of interdisciplinary knowledge-making. Like many 
other economies in the developed world, and many in the developing world as 
well, Finland is heir to the information economy with all of its perquisites. Like 
many privileged departments in Europe, Canada, North America and Aus- 
tralia, Finland's history departments, libraries and language banks have ben- 
efited from an aggressive institutional programme of digitisation and support. 
These three preconditions—a tradition of historical study, participation in the 
international information economy and institutional development funded on a 
national or international level —make it possible for scholars and universities to 
mark themselves out for distinction within the space of digital history research. 

The ‘universal’ power of the digital humanities has thus established an arena 
where newer institutions and national traditions of historical research can 
play, on equal terms, with the oldest and most distinguished universities in the 
world. The present volume demonstrates how scholars from a European bor- 
derland have harnessed this power to demonstrate their engagement with new 
methods, tools and critiques. 


Digital Directions: The Longue Durée, Identity, etc. 


Critical readers will want to know whether the digital histories of this volume 
are closing out or displacing other kinds of inquiry. By implication, they won- 
der whether departments that choose to invest in digital research are necessar- 
ily thereby foreclosing on other kinds of research strategy. The rationale behind 
fears such as these are located in the real, historical experience of intellectual 
‘turns’ in the academy: one meeting ground sometimes displaces another, and 
this was true of critical theory. Topics of study inherited by the 20th-century 
humanities from the 19th century and classical precedents included the 
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existence of ‘genres’ in fiction, the search for ideal character in the genre of 
biography, the ideal of succeeding generations of ‘reforny in the study of politi- 
cal regimes and the history of progress through a cascading series of perfec- 
tions (whether in the form of intellectual history, the history of science or the 
history of technology). 

As scholars came to understand the past with the aid of critical theory, old 
categories of research (genre, character, reform and progress) each held within 
them a doctrine that was itself a historical construct. What critical theory did 
for the liberal conscience of the university was to create a series of substitutions 
where a former doctrine was broken open in favour of a series of new research 
questions. Many of those questions were informed by politics (for example, 
jettisoning the doctrine of empire’s beneficence in favour of a series of critical 
questions). Rather than taking the 19th-century agenda of matters for histori- 
cal investigation as a given, it was possible to subject each of them to decon- 
struction and historical analysis. In the process, a new set of research questions 
emerged: instead of character, there was agency, and with it the question of 
under what conditions it became possible for an individual or a certain group 
to exert change over their fate or the course of collective experience. Instead of 
technology as the progress of inventions from one generation to the next, the 
history of science and technology was reborn as a series of questions about 
the ideology of science and technology; their affiliation with empire, masculin- 
ity and capital; the institutions that support them; and the illusion of forward 
progress. Critical theory buried the naive liberal scholarship that came before 
it, replacing it with a series of new research questions. 

Just as critical theory pushed out the set of uncritical liberal targets of research 
that came before it, so, it might be expected, will the new goals of digital 
history displace some of the focus of the scholarly record before them. It is too 
soon to tell what the subjects of replacement will be; the relationship between 
digital history and earlier generations of history is still in formation. Moreover, 
the number of digital history papers that directly counters an extant histori- 
cal theory is very small, in comparison with those in digital literature, where 
scholars such as Ted Underwood and Andrew Piper have explicitly taken 
on some of the mainstream conclusions of the field and shown how digitally 
produced knowledge overturns received wisdom about, say, the idiosyncrasy 
of Flaubert. 

While digital history remains, as yet, immature, the field as a whole is guided 
by theories of when, how and whether digital history will call for a revision of 
lasting tropes in the discipline. In The history manifesto, my co-author and I 
pushed the strongest possible case for a revolution in critical thinking abetted 
by access to digital tools, and we sought to describe what that might look like: in 
brief, we conjectured that longue-durée timescales of 100 years or more would 
displace microhistory on the scale of the human life or shorter. We advanced 
some related claims about the political crises of the present and the new longue- 
durée inquiries that they might provoke (shifting attention to climate change 
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and economic inequality over identity politics). How have those prophesies 
held up when it comes to the writing of digital history' five years later? 

One of our guesses was the renewed importance of longue-durée perspec- 
tives, given the fact that digital research made trivial the repetition of cultural- 
studies-type analytics on long time scales using scales of material that would be 
impossible for the traditional reader. Just such an approach is represented, in 
the current volume, by several uses of newspaper corpora. In Chapter 15, Heidi 
Hakkarainen and Zuhair Iftikhar prove that, over 21 years, the language of Aus- 
trian newspapers demonstrates the growing importance of historicist thought. 
They chart the succession of disciplines through which ideas of humanism 
spread, a modelling of ideas like a contagion, a transition from education to 
the reformation to revolution and later the death penalty. They reveal how 
newspapers gradually connected the human, the spiritual, democracy and the 
future, a new language evidenced by the increasing importance of the terms 
Zukunft (future) and Zeit (time) to the definition of humanism. In Chapter 11, 
Matti La Mela uses the Finnish parliamentary debates to investigate the free- 
dom to roam over a century. On an even longer time span, in Chapter 12, Pasi 
Ihalainen (with the help of Aleksi Sahala) identifies several kinds of discourse 
about internationalism since the founding of the League of Nations, including 
internationalism typically opposed to (but sometimes arising from) national- 
ism; variants of party and labour internationalism; the spirit of nationalism 
linked to ideas and values; and the promise of peace and democracy arising 
from internationalism, especially after 1930. A revised chronology of interna- 
tionalism shows a peak in the 1970s, followed by a period of frustration. 

New findings that draw from long timescales can border on the breathtaking. 
Perhaps the most surprising finding in Ihalainen and Sahala’s investigation of 
long-term discourses of internationalism was the longevity of ideas. Ihalainen 
found that down to the ‘leave’ side of Brexit, many of the ways of describing the 
promise and threat of internationalism remained unchanged since discussions 
of the League of Nations in the interwar period, on both the left and the right. 

Much of the work that was once carried out by the ‘close reading’ of the 
cultural turn in the 1980s and 1990s can now be achieved with greater precision 
and efficacy by algorithms designed to discern, and to measure, similarities and 
differences of expression and sentiment, allowing the tight comparison of dec- 
ades, institutions, political parties and individuals. The work in this volume aptly 
demonstrates that the work of the cultural and linguistic turns—concerned with 
the shift of lexicons and the insight this provides about historical identities and 
communities—are now, on their cutting edge, digital in method. 

There are real challenges facing the prosecution of the longue durée as 
well. In Chapter 3, Jari Eloranta, Pasi Nevalainen and Jari Ojala point to the 
overwhelming scale of the archives that still await digitisation in their domain, 
that of the modern business history of Finland. They cite the archives of 
Finland's 20th-century government administration since the 1970s, now held in 
the National Archives: ‘roughly 200 shelf-kilometres of documents essential to 
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understanding the history of the welfare state and neoliberalism. These archival 
materials would have to be scanned first before they were analysed by digital 
means, and they are not scheduled for digitisation. A more practical approach, 
for the moment, is the kind of sampling recommended by Claire Lemercier 
and Claire Zalc in their Quantitative methods in the humanities? The facts of 
this dynamic raise challenges to archival institutions, grant-writing bodies 
in the humanities and the institutions of democracy themselves. A survey could 
be conducted by sampling documents out of that 200 kilometres of shelves, 
but what would be left out? Surely citizens of each region in Finland deserve 
the tools to monitor the history of how changes in government organisation 
affected their own landscape. Serving citizens requires an infrastructure and the 
building of tools. If such an infrastructure needs to be built for other purposes, 
and the archives need to be digitised, then historians can rise to the challenge 
of asking questions about the longue durée of that 200 kilometres of shelves and 
its analysis. Indeed, it is possible, as Ted Underwood has lately argued, that trends 
otherwise invisible on the short durée would emerge from such an analysis." 

Some ofthe guesses hazarded by The history manifesto went against the actual 
course of scholarship, in particular the continuing importance of research into 
gender, race and class. Prognostications that identity politics would be displaced 
by a larger, shared concern over economic inequality proved short-sighted in 
the face of the rise of right-wing movements around the world and the renewed 
relevance of identity-based activism. To counter those movements, scholars, 
journalists, lawyers, artists and ordinary citizens have returned to the longue 
durée of injustice in a powerful way; for instance, through the demands for 
reparations for slavery in Britain and America, or in the public controversy 
over monuments to confederate generals across the American South. Indeed, 
until racism, sexism and nationalism are abolished, historians are bound to ask 
questions about where they came from. 

Thus, what we see in the new practice of digital history is not so much the 
displacement of critical theory by digital history, as the integration of the ques- 
tions posed by critical theory on longer time spans, addressed with methods 
that allow the historian to fully integrate the methods ofthe cultural, social and 
linguistic turns. The natural outgrowth of these dynamics is a kind of digital 
history that fixes on identity and empire as its subject, exemplified by the many 
projects gathered and reviewed in Roopika Risams Postcolonial digital humani- 
ties (2019). The present volume gathers studies in the history of Karelian bor- 
derlands and Finnish feminism. It is unsurprising that any cultural or political 
event could be traced at scale and in depth by digital means. 


Implications and Future Directions 


What are the implications of such a study for other scholars, if not to mark 
out a definitive set of algorithms or revisions for others? I would argue that 
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the output of such a volume as this is significant in that it sketches out the 
computational best practices of a moment. It also speaks frankly to some of 
the challenges ahead. It is to the work of coming challenges that I will write 
for the remainder of this conclusion. 


The need for theorising the bridge between distant and close readings 


The possibility of informing close reading with the power of synthetic tools is 
one of the major promises of digital history. Of his century-long examination 
of collocation in parliamentary discourse, Ihalainen explains, ‘distant reading 
reveals peculiar political points that might have gone unnoticed in mere close 
reading or full-text keyword searches of the same documents. 

While a few of the chapters in this book begin with distant reading and end 
by examining particular documents, there are few examples today of historical 
or literary practice that moves from the distant overview down to the level of 
authors or other categories and into particular passages in the book, critically 
examining the results of a search based on the close reading of the page. Like 
much of the work in the contemporary digital humanities, results of distant 
reading are frequently given in a summary chart or single finding. Two notable 
exceptions offer a meditation on the ‘bridge’ between the distant and the close, 
and provide historians with a way forward. In a rich meditation on the history 
of close reading among historians of women, in Chapter 9, Heidi Kurvinen 
draws a contrast between the tools she studied in her training and the topic 
modelling she applied to a study of Finnish suffragettes; in the process, she 
offers her reflections on using the topic model as an index of different episodes, 
and compares some of the findings of close and distant reckoning. Similarly, 
in Chapter 7, Johan Jarlbrink descends from a project that surveys the effects 
of media on cities to critically examine the sample of cities recognised by the 
computer. Bridging close and distant reading becomes, for Jarlbrink, an oppor- 
tunity to recognise problems in the measurements supplied by algorithmic 
tools on ‘dirty’ data. As we can see from the two examples above, the practice 
of distant and close reading is evolving, and new hybrids are being forged that 
unlock insights in the archives and highlight shortcomings in the technology. 

Future approaches to the bridge between close and distant reading may do 
well to follow the pattern set by Andrew Piper in his recent survey of modern 
literature, Enumerations, which proceeds from distant readings of the themes 
and trends across poetry and the novel, down to particular authors, poems and 
passages, as guided by the tools of distant reading." After all, the same tools 
that draw our attention to words can be used to compare individual speakers as 
well as parties, and indeed to draw attention to the particular paragraphs and 
sentences that the computer discerns to be the most exemplary cases of a par- 
ticular concept and collocation pattern. That is, where a scholar learns that an 
important collocate of ‘internationalism is ‘nation, it would be useful to learn 
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next which individual speaker in Parliament pairs the two words together most 
frequently, and the individual speech in which those concepts are juxtaposed 
the most. Additional measures such as these may lend confidence that particu- 
lar sentences given as examples of collocation are not merely cherry-picked for 
their familiarity, but actually offered foundational instances of the making or 
re-use of the concept. 

More to the point, as historians engage in moving from the small example to 
the big question, and from the big overview back to individual speech acts, the 
process of movement itself is open to methodological argument, questions of 
interpretation and over-interpretation. It is important that historians begin to 
describe their choice of exemplary passages. Thick description of the process of 
extraction affords an opportunity to articulate the work of human interpreta- 
tion and machine contextualisation. As we examine this frontier, we will begin 
to understand better the application of distant reading to work on different 
scales, including the scale of the corpus, the author, the work, the paragraph 
and the word. 


Theorising the difference between AI and statistical measures 


In the work here, the terms ‘mutual information’ and ‘neural nets’ appear on 
the same page. Their basis could not be more different, however. The former 
is statistical and can be described, mathematically, at every step; the latter has 
been developed mainly from computer-science departments aiming to mirror 
human processes, and essentially represents a black box of pattern recognition. 
Some scholars in the computer sciences herald a day where autonomous intel- 
ligence will obviate human supervision in most domains, including education. 
Colleagues in other parts of the university are more skeptical, arguing that AI, 
in most cases, relies on the labour-intensive hand-tooling of research questions 
to algorithms. More precise and transparent answers, they suggest, are to be 
had from old-fashioned statistics. 

Humanists are far from the centre of these debates, but our testing of algo- 
rithms and our successes and failures have implications beyond our own 
discipline. As Eloranta and his colleagues point out, business and economic 
historians have a long history of critical engagement with statistical measures 
such as regression and event analysis that could easily contribute to a rigorous 
comparison between statistical measures and unsupervised machine learning. 
Elsewhere, in Chapter 16, Välimäki and his colleagues rely on neural networks 
(the epitome of unsupervised, black-box AI), while another advances a prefer- 
ence for mutual information: the epitome of advanced statistics, where equally 
useful clusters are formed on the basis of a relatively transparent clustering 
formula. Because we know our textual corpora and their historical context so 
well, as the scholars’ research on authorship illustrates, historians are often in 
a better position to ‘train’ and ‘test’ the scripts of AI and to comment on how 
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well they work. These findings could usefully be published in the Papers of the 
National Academy of Sciences or Science, to the edification of other disciplines 
and the credit of our own. 

To the degree to which historical methods border on questions of the success 
of AI or a preference for transparent statistics, historians have an opportunity 
to theorise about the stakes of one choice over another. As the discipline of digi- 
tal history becomes more advanced, we should expect more work of this kind. 


Transparent documentation of the choice of algorithm, text and result 
in the practice of critical search 


Elsewhere, I have argued that scholarly engagement, both traditional and 
digital, in general tends towards the critical examination of the choices that 
inform a research project." Whether the choices made in an archival visit 
are informed by the reading of critical theory, or whether political interest in 
the longue durée drives a scholar towards particular algorithms, critical think- 
ing about the motives and limits of particular kinds of research is always being 
called upon to inform the constraints of the research process. I generalised the 
digital research process into a pattern I called ‘critical search, and I character- 
ised major opportunities for critical thinking inherent in any research process. 
Instances of critical search in an article might include discussing the choice of 
keywords and algorithms and how initial choices reveal and conceal aspects 
of a corpus later disclosed by adjustments to the initial search. The point, in 
any case, is that these are forms of critical reflection that scholars are prone 
to in general, and through these critical reflections, the entire community of 
readers comes to consensus about the uses of particular algorithms, the multi- 
ple dimensions of digital archives and the interpretative questions that govern 
digital research. 

The chapters in this volume offer freely evolved examples of critical search, in 
the sense that they reflect upon the process of creating knowledge with digital 
tools. In their search for how 19th-century Austrian newspapers described the 
emergent principle of humanism, Hakkarainen and Iftikhar describe multiple 
iterations of different kinds of topic models, configurations, with or without 
stop words, and how they ultimately decided on a combination of topic models 
with a corpus little prepared for analysis. The process of curation, decision- 
making and interpretation is shown to be at the heart of scholarly digital work. 

In Chapter 12, Ihalainen and Sahala explain their persuasive use of informa- 
tion theory's concept of ‘mutual information to examine lexical change over 
time. They use pointwise mutual information (PMI) to identify the most regu- 
lar combinations of words used to describe the international, the global and 
the transnational in British parliamentary debates in the 20th century. Their 
method begins with computational work, but traces back digital findings to 
the text, excerpting compelling passages illustrating the rising tide of interwar 
optimism about international relations. 
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We should expect more critical engagement with the search process in the 
future. As a community, we are learning how to better highlight the distance 
between interpretive work and computational work in each research process. 
In making the choices behind an algorithmic deployment transparent, digital 
scholars acknowledge that an algorithm isn't a toaster oven, into which a neo- 
phyte puts texts in order to achieve an automatic result. Rather, the process of 
curation, critical inquiry, secondary reading and interpretation remain at the 
heart of scholarly inquiry. 


Engagement with new standards of scholarship from the institutions of 
historical and cultural knowledge-making 


Several of the chapters in this volume implicitly call for a deeper level of par- 
ticipation by historians and their national and international societies in follow- 
ing standards of data preservation, sharing and transparency. In Chapter 10, 
Maiju Kannisto and Pekka Kauppinen offer a detailed description of copyright 
issues that had to be overcome for data analysis and sharing in the domain of 
contemporary media analysis. In Chapter 5, Jessica Parland-von Essen argues 
that the institutions of cultural analysis should take care to preserve, describe 
and make accessible multiple layers of data, including its mark-up, analyses, 
tools, descriptive metadata, consent, rights and attributions of labour. In the 
process, she describes a potential mountain of cultural labour to be executed by 
the libraries, archives and IT centres of the world. 

Parland-von Essens chapter suggests a precise charge to the meetings of 
national historical associations and other learned societies. All of our meetings 
should have not merely panels for presenting new work in the digital humani- 
ties, but also panels for discussing the standards of data presentation, annota- 
tion and interoperability. 

Cultural institutions (for instance, the Swedish and Finnish literary societies) 
have a particular role to play in setting out standards for data that is transparent 
andaccessible. Were they to engage the questions raised in this book, in meetings, 
pamphlets, conferences and hiring, they would have the opportunity to 
shape how the caretakers of data document the many kinds of labour 
that have shaped the collection, as well as how practising scholars indicate that 
they have used data with origins elsewhere. National and international historical 
meetings offer an important opportunity for inviting cultural institutions and 
providers of data, from our museums to the private Elseviers of the world, to 
cooperate with scholars in following these mandates. 

On the one hand, directives from national historical associations and confer- 
ences are profoundly needed. Even high-profile infrastructure initiatives such as 
Europeana do not currently provide workflows suited to historians’ needs. On 
the other hand, groups of historians concerned with these issues have already 
assembled over generations. Petri Paju’s Chapter 2 discovers a long-standing 
tradition for digital-historical institutions and collaborations: for example, the 
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Helsinki Corpus of English Texts with its origins in the 1980s, the Association 
for History and Computing, and the Electronic Center for History Research 
(later Agricola), launched in 1996 by scholars, librarians and archivists; 
and later the Helsinki Centre for Digital Humanities, or HELDIG, was estab- 
lished at the University of Helsinki in 2016. If these institutions have histori- 
cally had little influence over mainstream practice, then a new generation of 
national historical associations and other learned societies could usefully mine 
the affiliates list of the digital centres for faculties already invested in the impor- 
tant issues of data preservation, documentation and accessibility. 


International collaborations and tool sharing 


The beauty of computational research is its interoperability: once a technique 
has been discovered for author attribution in medieval Latin, it should work 
nearly as well on any early Latin texts whatsoever. The same is true for parlia- 
mentary discourse: the collocate machine described by Ihalainen and Sahala 
should apply to any modern body of text, and the particular tactics for engag- 
ing party-political differences of lexicon should be immediately applicable to 
digitised records of the French débats, the debates of the European Union, the 
Canadian and Australian Hansards and the debates of the City Council of New 
York City, to name a few active projects. Despite these opportunities, how- 
ever, there are relatively few examples of international collaborations that take 
advantage of the astonishing interoperability of algorithms by drawing together 
scholars working on similar genres of texts. 

One initiative that successfully crosses these boundaries is the Oceanic 
Exchanges project,? to which six nations (including Finland, represented by 
Hannu Salmi at Turun Yliopisto) contribute their historical newspapers and 
their technical expertise. In the case of newspapers, text-recognition technol- 
ogy applied in one nation can rapidly be adopted to newspapers elsewhere, 
so the international collaboration represents a massive virtuous cycle of 
exchanges. All of these efforts contribute to making the international infra- 
structure of future research, such that one day soon, we should expect that 
high-school students of history in Finland, America and Mexico will be able 
to keyword search the newspapers of their national traditions and compare 
debates about democracy and markets across nations over 200 years. 

It is harder to explain why there is not such an international collaboration 
for the novel, for debates of democratic bodies, for the records of courts of 
law, for religious texts, for plays, for musical notation, for stylometric author- 
ship attribution and for other genre-specific questions where scholars have 
similar questions related to form: comparing authors and chapters of the 
novel, for instance; comparing speakers, parties and constituencies in demo- 
cratic debate; comparing kinds of charges, prosecution and defence; or judges, 
juries and defendants in the courts of law. In each of these genres, an interna- 
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tional pooling of technical expertise could result in the rapid creation of new 
knowledge about best methods for turning raw visual images into readable text, 
for turning text into data annotated with appropriate metadata categories and 
for deriving meaning over time. The only bar against such collaborations is one 
of organisation, effort and collegiality: creating useful partnerships depends on 
planning for international visits, sharing plans of work and the willingness to 
engage in a process of mutual discernment about where grants and research 
projects overlap. 

The benefits of such collaborations are, of course, tremendous: they can result 
in the pooling of common goals and strategies, the most efficient use of grant 
money, the joint discovery of new methodologies and even the joint funding of 
new infrastructure that makes clean, accessible data available to all. 


Notes 


! Stilgoe 1982. 

? Arendt 1973 [1958]. 

? Risam 2019. 

^ Michel et al. 2011. 

5 Stone 1979. 

° Klingenstein, Hitchcock & DeDeo 2014; Barron et al. 2018; Kraicer & Piper 
2019. 

? Blair 2010; Murdock, Allen & DeDeo 2017. 

* Weeks 1982; Gay 1986; Schwartz 2001; Brown 2013. 

? Lemercier & Zalc 2019. 

10 Underwood 2019. 

" Piper 2019. 

12 Guldi 2018. 

? The website of the project is https://oceanicexchanges.org/. 
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